Testing logistic regression coefficients with clustered data and few positive outcomes

Biometric Research Branch, National Cancer Institute, Bethesda, MD 20892, U.S.A.
Statistics in Medicine (Impact Factor: 1.83). 04/2008; 27(8):1305-24. DOI: 10.1002/sim.3011
Source: PubMed


Applications frequently involve logistic regression analysis with clustered data where there are few positive outcomes in some of the independent variable categories. For example, an application is given here that analyzes the association of asthma with various demographic variables and risk factors using data from the third National Health and Nutrition Examination Survey, a weighted multi stage cluster sample. Although there are 742 asthma cases in all (out of 18,395 individuals), for one of the categories of one of the independent variables there are only 25 asthma cases (out of 695 individuals). Generalized Wald and score hypothesis tests, which use appropriate cluster-level variance estimators, and a bootstrap hypothesis test have been proposed for testing logistic regression coefficients with cluster samples. When there are few positive outcomes, simulations presented in this paper show that these tests can sometimes have either inflated or very conservative levels. A simulation-based method is proposed for testing logistic regression coefficients with cluster samples when there are few positive outcomes. This testing methodology is shown to compare favorably with the generalized Wald and score tests and the bootstrap hypothesis test in terms of maintaining nominal levels. The proposed method is also useful when testing goodness-of-fit of logistic regression models using deciles-of-risk tables.

10 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genetic data collected from the Third National Health and Nutrition Examination Survey (NHANES III) provides an opportunity to investigate associations between genetic variations and health-related phenotypes for the US population. Complex sample designs involving stratified multistage cluster sampling and sample weighting are used to sample families in household surveys such as the NHANES III. We modified conditional likelihood score and trend tests used to test the null hypothesis of no association between a candidate gene and a phenotype in simple random samples of nuclear families so that these tests are applicable to data from complex sample designs. The finite sample properties of our modified test procedures are evaluated via Monte Carlo simulation studies. We recommend using an F-version of the trend test instead of a score test because the F-test shows greater power. Our test statistics are applied to NHANES III data to test for associations between the locus ADRB2 (rs1042713) and obesity, VDR (rs2239185) and high blood lead level, and TGFB1 (rs1982073) and asthma.
    Statistics and its interface 01/2014; 7(2):167-176. DOI:10.4310/SII.2014.v7.n2.a2 · 2.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The delete-a-group jackknife is sometimes used when estimating the variances of statistics based on a large sample. We investigate heavily poststratified estimators for a population mean and a simple regression coefficient, where both full-sample and domain estimates are of interest. The delete-a-group (DAG) jackknife employing 30, 60, and 100 replicates is found to be highly unstable, even for large sample sizes. The empirical degrees of freedom of these DAG jackknives are usually much less than their nominal degrees of freedom. This analysis calls into question whether coverage intervals derived from replication-based variance estimators can be trusted for highly calibrated estimates.
    Communication in Statistics- Simulation and Computation 06/2014; 43(10). DOI:10.1080/03610918.2012.762392 · 0.33 Impact Factor