[Show abstract][Hide abstract] ABSTRACT: Generalized linear mixed models (GLMMs) are often fit by computational procedures such as penalized quasi-likelihood (PQL). Special cases of GLMMs are generalized linear models (GLMs), which are often fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space constraints make it difficult to apply these iterative procedures to datasets having a very large number of records. We propose a computationally efficient strategy based on the Gauss-Seidel algorithm that iteratively fits submodels of the GLMM to collapsed versions of the data. The strategy is applied to investigate the relationship between ischemic heart disease, socioeconomic status, and age/gender category in New South Wales, Australia, based on outcome data consisting of approximately 33 million records. For Poisson and binomial regression models, the Gauss-Seidel approach is found to substantially outperform existing methods in terms of maximum analyzable sample size. Remarkably, for both models, the average time per iteration and the total time until convergence of the Gauss-Seidel procedure are less than 0.3% of the corresponding times for the IWLS algorithm. Platform-independent pseudo-code for fitting GLMS, as well as the source code used to generate and analyze the datasets in the simulation studies, are available online as supplemental materials.
Journal of Computational and Graphical Statistics 01/2012; 18(4):818-837. DOI:10.1198/jcgs.2009.06127 · 1.22 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The independent additive errors linear model consists of a structure for the mean and a separate structure for the error distribution. The error structure may be parametric or it may be semiparametric. Under alternative values of the mean structure, the best fitting additive errors model has an error distribution which can be represented as the convolution of the actual error distribution and the marginal distribution of a misspecification term. The model misspecification term results from the covariates' distribution. Conditions are developed to distinguish when the semiparametric model yields sharper inference than the parametric model and vice versa. The main conditions concern the actual error distribution and the covariates' distribution. The theoretical results explain a paradoxical finding in semiparametric Bayesian modelling, where the posterior distribution under a semiparametric model is found to be more concentrated than is the posterior distribution under a corresponding parametric model. The paradox is illustrated on a set of allometric data. The Canadian Journal of Statistics 39: 165-180; 2011 (C) 2011 Statistical Society of Canada
Canadian Journal of Statistics 03/2011; 39(1):165 - 180. DOI:10.1002/cjs.10091 · 0.65 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Mixture models, or convex combinations of a countable number of probability distributions, offer an elegant framework for inference when the population of interest can be subdivided into latent clusters having random characteristics that are heterogeneous between, but homogeneous within, the clusters. Traditionally, the different kinds of mixture models have been motivated and analyzed from very different perspectives, and their common characteristics have not been fully appreciated. The inferential techniques developed for these models usually necessitate heavy computational burdens that make them difficult, if not impossible, to apply to the massive data sets increasingly encountered in real world studies. This paper introduces a flexible class of models called generalized polya urn (GPU) processes. Many common mixture models, including finite mixtures, hidden Markov models, and Dirichlet processes, are obtained as special cases of CPU processes. Other important special cases include finite-dimensional Dirichlet priors, infinite hidden Markov models, analysis of densities models, nested Chinese restaurant processes, hierarchical DP models, nonparametric density models, spatial Dirichlet processes, weighted mixtures of DP priors, and nested Dirichlet processes. An investigation of the theoretical properties of GPU processes offers new insight into asymptotics that form the basis of cost-effective Markov chain Monte Carlo (MCMC) strategies for large datasets. These MCMC techniques have the advantage of providing inferences from the posterior of interest, rather than an approximation, and are applicable to different mixture models. The versatility and impressive gains of the methodology are demonstrated by simulation studies and by a semiparametric Bayesian analysis of high-resolution comparative genomic hybridization data on lung cancer. The appendixes are available online as supplemental material.
Journal of the American Statistical Association 06/2010; 105(490):775-786. DOI:10.1198/jasa.2010.tm09340 · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Genomic alterations have been linked to the development and progression of cancer. The technique of comparative genomic hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data.We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Because the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme, and breast cancer are analyzed, and comparisons are made with some widely used algorithms to illustrate the reliability and success of the technique.
Journal of the American Statistical Association 06/2008; 103(482):485-497. DOI:10.1198/016214507000000923 · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Generalized linear mixed models with semiparametric random effects are useful in a wide variety of Bayesian applications. When the random effects arise from a mixture of Dirichlet process (MDP) model with normal base measure, Gibbs sampling algorithms based on the Pólya urn scheme are often used to simulate posterior draws in conjugate models (essentially, linear regression models and models for binary outcomes). In the non-conjugate case, the algorithms proposed by MacEachern and Müller (1998) and Neal (2000) are often applied to generate posterior samples. Some common problems associated with simulation algorithms for non-conjugate models include convergence and mixing difficulties. This paper proposes an algorithm for MDP models with exponential family likelihoods and nor-mal base measures. The algorithm proceeds by making a Laplace approximation to the likelihood function, thereby matching the proposal with that of the Gibbs sampler. The proposal is accepted or rejected via a Metropolis-Hastings step. For conjugate MDP models, the algorithm is identical to the Gibbs sampler. The performance of the technique is investigated using a Poisson regression model with semiparametric random effects. The algorithm performs efficiently and reliably, even in problems where large sample results do not guarantee the success of the Laplace approximation.. The author thanks Professor Steven MacEachern, the unknown Associate Editor and the unknown referee for many insightful comments that helped improve the focus of the paper. 1 This is demonstrated by a simulation study where most of the count data consist of small numbers. The technique is associated with substantial benefits relative to existing methods, both in terms of convergence properties and computational cost.
Journal of Computational and Graphical Statistics 06/2008; 17(2). DOI:10.1198/106186008X319854 · 1.22 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The paper is motivated by cure detection among the prostate cancer patients in the National Institutes of Health surveillance epidemiology and end results programme, wherein the main end point (e.g. deaths from prostate cancer) and the censoring causes (e.g. deaths from heart diseases) may be dependent. Although many researchers have studied the mixture survival model to analyse survival data with non-negligible cure fractions, none has studied the mixture cure model in the presence of dependent censoring. To account for such dependence, we propose a more general cure model that allows for dependent censoring. We derive the cure models from the perspective of competing risks and model the dependence between the censoring time and the survival time by using a class of Archimedean copula models. Within this framework, we consider the parameter estimation, the cure detection and the two-sample comparison of latency distributions in the presence of dependent censoring when a proportion of patients is deemed cured. Large sample results by using martingale theory are obtained. We examine the finite sample performance of the proposed methods via simulation and apply them to analyse the surveillance epidemiology and end results prostate cancer data. Copyright 2007 Royal Statistical Society.
Journal Of The Royal Statistical Society 06/2007; 69(3):285-306. DOI:10.1111/j.1467-9868.2007.00589.x · 3.52 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Benchmark estimation is motivated by the goal of producing an approximation to a poste- rior distribution that is better than the empirical distribution function. This is accomplished by incorporating additional information into the construction of the approximation. We fo- cus here on generalized post-stratication , the most successful implementation of benchmark estimation in our experience. We develop generalized post-stratication for settings where the source of the simulation diers from the posterior which is to be approximated. This al- lows us to use the techniques in settings where it is advantageous to draw from a distribution dieren t than the posterior, whether this is for exploration of the data and/or model, for algorithmic simplicity, to improve convergence of the simulation or for improved estimation of selected features of the posterior. We develop an asymptotic (in simulation size) theory for the estimators, providing con- ditions under which central limit theorems hold. The central limit theorems apply both to an importance sampling context and to direct sampling from the posterior distribution. The asymptotic results, coupled with large sample (size of data) approximation results provide guidance on how to implement generalized post-stratication . The theoretical results also explain the gains associated with generalized post-stratication and the empirically observed robustness to cutpoints for the strata. We note that the results apply well beyond the setting of Markov chain Monte Carlo simulation.
Journal of the American Statistical Association 02/2006; 101(September):1175-1184. DOI:10.1198/016214506000000474 · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The recently funded Spatial Environmental Epidemiology in New South Wales (SEE NSW) project aims to use routinely collected
data in NSW Australia to investigate risk factors for various chronic diseases. In this paper, we present a case study focused
on the relationship between social disadvantage and ischemic heart disease to highlight some of the methodological challenges
that are likely to arise.
[Show abstract][Hide abstract] ABSTRACT: While studying various features of the posterior distribution of a vector-valued parameter using an MCMC sample, a subsample is often all that is available for analysis. The goal of benchmark estimation is to use the best available information, i.e. the full MCMC sample, to improve future estimates made on the basis of the subsample. We discuss a simple approach to do this and provide a theoretical basis for the method. The methodology and beneflts of benchmark estimation are illustrated using a well-known example from the literature. We obtain as much as an 80% reduction in MSE with the technique based on a 1-in-10 subsample and show that greater beneflts accrue with the thinner subsamples that are often used in practice.
Journal of Computational and Graphical Statistics 09/2004; 13(3). DOI:10.1198/106186004X2598 · 1.22 Impact Factor