Computational Statistics & Data Analysis

Published by Elsevier
Print ISSN: 0167-9473
Genotyping errors that are undetected in genome-wide association studies using single nucleotide polymorphisms (SNPs) may degrade the likelihood of detecting true positive associations. To estimate the frequency of genotyping errors and assess the reproducibility of genotype calls, we analyzed two sets of duplicate data, one dataset containing twenty blind duplicates and another dataset containing twenty-eight non-random duplicates, from a genome-wide association study using Affymetrix GeneChip® 100K Human Mapping Arrays. For the twenty blind duplicates the overall agreement in genotyping calls as measured with the Kappa statistics, was 0.997, with a discordancy rate of 0.27%. For the twenty-eight nonrandom duplicates, the overall agreement was lower, 0.95, with a higher discordancy rate of 4.53%. The accuracy and probability of concordancy were inversely related to the genotyping uncertainty score, i.e., as the genotyping uncertainty score increased, the concordancy and probability of concordant calls decreased. Lowering of the uncertainty score threshold for rejection of genotype calls from the Affymetrix recommended value of 0.25 to 0.20 resulted in an increased predicted accuracy from 92.6% to 95% with a slight increase in the "No Call" rate from 1.81% to 2.33%. Hence, we suggest using a lower uncertainty score threshold, say 0.20, which will result in higher accuracy in calls at a modest decrease in the call rate.
A new comprehensive procedure for statistical analysis of two-dimensional polyacrylamide gel electrophoresis (2D PAGE) images is proposed, including protein region quantification, normalization and statistical analysis. Protein regions are defined by the master watershed map that is obtained from the mean gel. By working with these protein regions, the approach bypasses the current bottleneck in the analysis of 2D PAGE images: it does not require spot matching. Background correction is implemented in each protein region by local segmentation. Two-dimensional locally weighted smoothing (LOESS) is proposed to remove any systematic bias after quantification of protein regions. Proteins are separated into mutually independent sets based on detected correlations, and a multivariate analysis is used on each set to detect the group effect. A strategy for multiple hypothesis testing based on this multivariate approach combined with the usual Benjamini-Hochberg FDR procedure is formulated and applied to the differential analysis of 2D PAGE images. Each step in the analytical protocol is shown by using an actual dataset. The effectiveness of the proposed methodology is shown using simulated gels in comparison with the commercial software packages PDQuest and Dymension. We also introduce a new procedure for simulating gel images.
The antibody microarray is a powerful chip-based technology for profiling hundreds of proteins simultaneously and is used increasingly nowadays. To study humoral response in pancreatic cancers, Patwa et al. (2007) developed a two-dimensional liquid separation technique and built a two-dimensional antibody microarray. However, identifying differential expression regions on the antibody microarray requires the use of appropriate statistical methods to fairly assess the large amounts of data generated. In this paper, we propose a permutation-based test using spatial information of the two-dimensional antibody microarray. By borrowing strength from the neighboring differentially expressed spots, we are able to detect the differential expression region with very high power controlling type I error at 0.05 in our simulation studies. We also apply the proposed methodology to a real microarray dataset.
A new approach to species distribution modelling based on unsupervised classification via a finite mixture of GAMs incorporating habitat suitability curves is proposed. A tailored EM algorithm is outlined for computing maximum likelihood estimates. Several submodels incorporating various parameter constraints are explored. Simulation studies confirm, that under certain constraints, the habitat suitability curves are recovered with good precision. The method is also applied to a set of real data concerning presence/absence of observable small mammal indices collected on the Tibetan plateau. The resulting classification was found to correspond to species-level differences in habitat preference described in previous ecological work.
When comparing sensitivities and specificities from multiple diagnostic tests, particularly in biomedical research, the different test kits under study are applied to groups of subjects with the same disease status for a disease or medical condition under consideration. Although this process gives rise to clustered or correlated test outcomes, the associated inference issues are well recognized and have been widely discussed in the literature. In mental health and psychosocial research, sensitivity and specificity have also been widely used to study the reliability of instrument for diagnosing mental health and psychiatric conditions and assessing certain behavioral patterns. However, unlike biomedical applications, outcomes are often obtained under varying reference standards or different diagnostic criteria, precluding the application of existing methods for comparing multiple diagnostic tests to such a research setting. In this paper, we develop a new approach to address these problems (including that of missing data) by extending recent work on inference using inverse probability weighted estimates. The approach is illustrated with data from two studies in sexual abuse and health research as well as a limited simulation study, with the latter used to study the performance of the proposed procedure.
Breast cancer is the most common non-skin cancer in women and the second most common cause of cancer-related death in U.S. women. It is well known that the breast cancer survival varies by age at diagnosis. For most cancers, the relative survival decreases with age but breast cancer may have the unusual age pattern. In order to reveal the stage risk and age effects pattern, we propose the semiparametric accelerated failure time partial linear model and develop its estimation method based on the P-spline and the rank estimation approach. The simulation studies demonstrate that the proposed method is comparable to the parametric approach when data is not contaminated, and more stable than the parametric methods when data is contaminated. By applying the proposed model and method to the breast cancer data set of Atlantic county, New Jersey from SEER program, we successfully reveal the significant effects of stage, and show that women diagnosed around 38s have consistently higher survival rates than either younger or older women.
Compared to the proportional hazards model and accelerated failure time model, the accelerated hazards model has a unique property in its application, in that it can allow gradual effects of the treatment. However, its application is still very limited, partly due to the complexity of existing semiparametric estimation methods. We propose a new semiparametric estimation method based on the induced smoothing and rank type estimates. The parameter estimates and their variances can be easily obtained from the smoothed estimating equation; thus it is easy to use in practice. Our numerical study shows that the new method is more efficient than the existing methods with respect to its variance estimation and coverage probability. The proposed method is employed to reanalyze a data set from a brain tumor treatment study.
The semiparametric accelerated hazards mixture cure model provides a useful alternative to analyze survival data with a cure fraction if covariates of interest have a gradual effect on the hazard of uncured patients. However, the application of the model may be hindered by the computational intractability of its estimation method due to non-smooth estimating equations involved. We propose a new semiparametric estimation method based on a smooth estimating equation for the model and demonstrate that the new method makes the parameter estimation more tractable without loss of efficiency. The proposed method is used to fit the model to a SEER breast cancer data set.
With three ordinal diagnostic categories, the most commonly used measures for the overall diagnostic accuracy are the volume under the ROC surface (VUS) and partial volume under the ROC surface (PVUS), which are the extensions of the area under the ROC curve (AUC) and partial area under the ROC curve (PAUC), respectively. A gold standard (GS) test on the true disease status is required to estimate the VUS and PVUS. However, oftentimes it may be difficult, inappropriate, or impossible to have a GS because of misclassification error, risk to the subjects or ethical concerns. Therefore, in many medical research studies, the true disease status may remain unobservable. Under the normality assumption, a maximum likelihood (ML) based approach using the expectation-maximization (EM) algorithm for parameter estimation is proposed. Three methods using the concepts of generalized pivot and parametric/nonparametric bootstrap for confidence interval estimation of the difference in paired VUSs and PVUSs without a GS are compared. The coverage probabilities of the investigated approaches are numerically studied. The proposed approaches are then applied to a real data set of 118 subjects from a cohort study in early stage Alzheimer's disease (AD) from the Washington University Knight Alzheimer's Disease Research Center to compare the overall diagnostic accuracy of early stage AD between two different pairs of neuropsychological tests.
In the fields of genomics and high dimensional biology (HDB), massive multiple testing prompts the use of extremely small significance levels. Because tail areas of statistical distributions are needed for hypothesis testing, the accuracy of these areas is important to confidently make scientific judgments. Previous work on accuracy was primarily focused on evaluating professionally written statistical software, like SAS, on the Statistical Reference Datasets (StRD) provided by National Institute of Standards and Technology (NIST) and on the accuracy of tail areas in statistical distributions. The goal of this paper is to provide guidance to investigators, who are developing their own custom scientific software built upon numerical libraries written by others. In specific, we evaluate the accuracy of small tail areas from cumulative distribution functions (CDF) of the Chi-square and t-distribution by comparing several open-source, free, or commercially licensed numerical libraries in Java, C, and R to widely accepted standards of comparison like ELV and DCDFLIB. In our evaluation, the C libraries and R functions are consistently accurate up to six significant digits. Amongst the evaluated Java libraries, Colt is most accurate. These languages and libraries are popular choices among programmers developing scientific software, so the results herein can be useful to programmers in choosing libraries for CDF accuracy.
In many biomedical applications, researchers encounter semicontinuous data where data are either continuous or zero. When the data are collected over time the observations may be correlated. Analysis of this kind of longitudinal semicontinuous data is challenging due to the presence of strong skewness in the data. A flexible class of zero-inflated models in a longitudinal setting is developed. A Bayesian approach is used to analyze longitudinal data from an acupuncture clinical trial, in which the effects of active acupuncture, sham acupuncture and standard medical care is compared on chemotherapy-induced nausea in patients who were treated for advanced breast cancer. A spline model is introduced into the linear predictor of the model to explore the possibility of a nonlinear treatment effect. Possible serial correlation between successive observations is also accounted using the Brownian motion. Thus, the approach taken in this paper provides for a more flexible modeling framework and, with the use of WinBUGS, provides for a computationally simpler approach than direct maximum-likelihood. The Bayesian methodology is illustrated with the acupuncture clinical trial data.
The paper addresses a common problem in the analysis of high-dimensional high-throughput "omics" data, which is parameter estimation across multiple variables in a set of data where the number of variables is much larger than the sample size. Among the problems posed by this type of data are that variable-specific estimators of variances are not reliable and variable-wise tests statistics have low power, both due to a lack of degrees of freedom. In addition, it has been observed in this type of data that the variance increases as a function of the mean. We introduce a non-parametric adaptive regularization procedure that is innovative in that : (i) it employs a novel "similarity statistic"-based clustering technique to generate local-pooled or regularized shrinkage estimators of population parameters, (ii) the regularization is done jointly on population moments, benefiting from C. Stein's result on inadmissibility, which implies that usual sample variance estimator is improved by a shrinkage estimator using information contained in the sample mean. From these joint regularized shrinkage estimators, we derived regularized t-like statistics and show in simulation studies that they offer more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored. Finally, we show that these estimators feature interesting properties of variance stabilization and normalization that can be used for preprocessing high-dimensional multivariate data. The method is available as an R package, called 'MVR' ('Mean-Variance Regularization'), downloadable from the CRAN website.
Linear mixed-effects models involve fixed effects, random effects and covariance structure, which require model selection to simplify a model and to enhance its interpretability and predictability. In this article, we develop, in the context of linear mixed-effects models, the generalized degrees of freedom and an adaptive model selection procedure defined by a data-driven model complexity penalty. Numerically, the procedure performs well against its competitors not only in selecting fixed effects but in selecting random effects and covariance structure as well. Theoretically, asymptotic optimality of the proposed methodology is established over a class of information criteria. The proposed methodology is applied to the BioCycle study, to determine predictors of hormone levels among premenopausal women and to assess variation in hormone levels both between and within women across the menstrual cycle.
Generalized additive models (GAMs) have distinct advantages over generalized linear models as they allow investigators to make inferences about associations between outcomes and predictors without placing parametric restrictions on the associations. The variable of interest is often smoothed using a locally weighted regression (LOESS) and the optimal span (degree of smoothing) can be determined by minimizing the Akaike Information Criterion (AIC). A natural hypothesis when using GAMs is to test whether the smoothing term is necessary or if a simpler model would suffice. The statistic of interest is the difference in deviances between models including and excluding the smoothed term. As approximate chi-square tests of this hypothesis are known to be biased, permutation tests are a reasonable alternative. We compare the type I error rates of the chi-square test and of three permutation test methods using synthetic data generated under the null hypothesis. In each permutation method a distribution of differences in deviances is obtained from 999 permuted datasets and the null hypothesis is rejected if the observed statistic falls in the upper 5% of the distribution. One test is a conditional permutation test using the optimal span size for the observed data; this span size is held constant for all permutations. This test is shown to have an inflated type I error rate. Alternatively, the span size can be fixed a priori such that the span selection technique is not reliant on the observed data. This test is shown to be unbiased; however, the choice of span size is not clear. A third method is an unconditional permutation test where the optimal span size is selected for observed and permuted datasets. This test is unbiased though computationally intensive.
Many diagnostic tools and goodness-of-fit measures, such as the Akaike information criterion (AIC) and the Bayesian deviance information criterion (DIC), are available to evaluate the overall adequacy of linear regression models. In addition, visually assessing adequacy in models has become an essential part of any regression analysis. In this paper, we focus on a spatial consideration of the local DIC measure for model selection and goodness-of-fit evaluation. We use a partitioning of the DIC into the local DIC, leverage, and deviance residuals to assess local model fit and influence for both individual observations and groups of observations in a Bayesian framework. We use visualization of the local DIC and differences in local DIC between models to assist in model selection and to visualize the global and local impacts of adding covariates or model parameters. We demonstrate the utility of the local DIC in assessing model adequacy using HIV prevalence data from pregnant women in the Butare province of Rwanda during 1989-1993 using a range of linear model specifications, from global effects only to spatially varying coefficient models, and a set of covariates related to sexual behavior. Results of applying the diagnostic visualization approach include more refined model selection and greater understanding of the models as applied to the data.
The concept of assumption adequacy averaging is introduced as a technique to develop more robust methods that incorporate assessments of assumption adequacy into the analysis. The concept is illustrated by using it to develop a method that averages results from the t-test and nonparametric rank-sum test with weights obtained from using the Shapiro-Wilk test to test the assumption of normality. Through this averaging process, the proposed method is able to rely more heavily on the statistical test that the data suggests is superior for each individual gene. Subsequently, this method developed by assumption adequacy averaging outperforms its two component methods (the t-test and rank-sum test) in a series of traditional and bootstrap-based simulation studies. The proposed method showed greater concordance in gene selection across two studies of gene expression in acute myeloid leukemia than did the t-test or rank-sum test. An R routine to implement the method is available upon request.
With the advent of powerful computers, simulation studies are becoming an important tool in statistical methodology research. However, computer simulations of a specific process are only as good as our understanding of the underlying mechanisms. An attractive supplement to simulations is the use of plasmode datasets. Plasmodes are data sets that are generated by natural biologic processes, under experimental conditions that allow some aspect of the truth to be known. The benefit of the plasmode approach is that the data are generated through completely natural processes, thus circumventing the common concern of the realism and accuracy of computer simulated data. The estimation of admixture, or the proportion of an individual's genome that originates from different founding populations, is a particularly difficult research endeavor that is well suited to the use of plasmodes. Current methods have been tested with simulations of complex populations where the underlying mechanisms such as the rate and distribution of recombination are not well understood. To demonstrate the utility of this method data derived from mouse crosses is used to evaluate the effectiveness of several admixture estimation methodologies. Each cross shares a common founding population so that the ancestry proportion for each individual is known, allowing for the comparison of true and estimated individual admixture values. Analysis shows that the different estimation methodologies (Structure, AdmixMap and FRAPPE) examined all perform well with simple datasets. However, the performance of the estimation methodologies varied greatly when applied to a plasmode consisting of three founding populations. The results of these examples illustrate the utility of plasmodes in the evaluation of statistical genetics methodologies.
Motivated by recent developments on dimension reduction (DR) techniques for time series data, the association of a general deterrent effect towards South Carolina (SC)'s registration and notification (SORN) policy for preventing sex crimes was examined. Using adult sex crime arrestee data from 1990 to 2005, the the idea of Central Mean Subspace (CMS) is extended to intervention time series analysis (CMS-ITS) to model the sequential intervention effects of 1995 (the year SC's SORN policy was initially implemented) and 1999 (the year the policy was revised to include online notification) on the time series spectrum. The CMS-ITS model estimation was achieved via kernel smoothing techniques, and compared to interrupted auto-regressive integrated time series (ARIMA) models. Simulation studies and application to the real data underscores our model's ability towards achieving parsimony, and to detect intervention effects not earlier determined via traditional ARIMA models. From a public health perspective, findings from this study draw attention to the potential general deterrent effects of SC's SORN policy. These findings are considered in light of the overall body of research on sex crime arrestee registration and notification policies, which remain controversial.
We consider an extension of the temporal epidemic-type aftershock sequence (ETAS) model with random effects as a special case of a well-known doubly stochastic self-exciting point process. The new model arises from a deterministic function that is randomly scaled by a nonnegative random variable, which is unobservable but assumed to follow either positive stable or one-parameter gamma distribution with unit mean. Both random effects models are of interest although the one-parameter gamma random effects model is more popular when modeling associated survival times. Our estimation is based on the maximum likelihood approach with marginalized intensity. The methods are shown to perform well in simulation experiments. When applied to an earthquake sequence on the east coast of Taiwan, the extended model with positive stable random effects provides a better model fit, compared to the original ETAS model and the extended model with one-parameter gamma random effects.
In survival analysis, it is of interest to appropriately select significant predictors. In this paper, we extend the AIC(C) selection procedure of Hurvich and Tsai to survival models to improve the traditional AIC for small sample sizes. A theoretical verification under a special case of the exponential distribution is provided. Simulation studies illustrate that the proposed method substantially outperforms its counterpart: AIC, in small samples, and competes it in moderate and large samples. Two real data sets are also analyzed.
In practical data analysis, nonresponse phenomenon frequently occurs. In this paper, we propose an empirical likelihood based confidence interval for a common mean by combining the imputed data, assuming that data are missing completely at random. Simulation studies show that such confidence intervals perform well, even the missing proportion is high. Our method is applied to an analysis of a real data set from an AIDS clinic trial study.
The two main algorithms that have been considered for fitting constrained marginal models to discrete data, one based on Lagrange multipliers and the other on a regression model, are studied in detail. It is shown that the updates produced by the two methods are identical, but that the Lagrangian method is more efficient in the case of identically distributed observations. A generalization is given of the regression algorithm for modelling the effect of exogenous individual-level covariates, a context in which the use of the Lagrangian algorithm would be infeasible for even moderate sample sizes. An extension of the method to likelihood-based estimation under L 1-penalties is also considered.
Microarray technology has made it possible to investigate expression levels, and more recently methylation signatures, of thousands of genes simultaneously, in a biological sample. Since more and more data from different biological systems or technological platforms are being generated at an incredible rate, there is an increasing need to develop statistical methods that are applicable to multiple data types and platforms. Motivated by such a need, a flexible finite mixture model that is applicable to methylation, gene expression, and potentially data from other biological systems, is proposed. Two major thrusts of this approach are to allow for a variable number of components in the mixture to capture non-biological variation and small biases, and to use a robust procedure for parameter estimation and probe classification. The method was applied to the analysis of methylation signatures of three breast cancer cell lines. It was also tested on three sets of expression microarray data to study its power and type I error rates. Comparison with a number of existing methods in the literature yielded very encouraging results; lower type I error rates and comparable/better power were achieved based on the limited study. Furthermore, the method also leads to more biologically interpretable results for the three breast cancer cell lines.
The analysis of point-level (geostatistical) data has historically been plagued by computational difficulties, owing to the high dimension of the nondiagonal spatial covariance matrices that need to be inverted. This problem is greatly compounded in hierarchical Bayesian settings, since these inversions need to take place at every iteration of the associated Markov chain Monte Carlo (MCMC) algorithm. This paper offers an approach for modeling the spatial correlation at two separate scales. This reduces the computational problem to a collection of lower-dimensional inversions that remain feasible within the MCMC framework. The approach yields full posterior inference for the model parameters of interest, as well as the fitted spatial response surface itself. We illustrate the importance and applicability of our methods using a collection of dense point-referenced breast cancer data collected over the mostly rural northern part of the state of Minnesota. Substantively, we wish to discover whether women who live more than a 60-mile drive from the nearest radiation treatment facility tend to opt for mastectomy over breast conserving surgery (BCS, or "lumpectomy"), which is less disfiguring but requires 6 weeks of follow-up radiation therapy. Our hierarchical multiresolution approach resolves this question while still properly accounting for all sources of spatial association in the data.
According to the American Cancer Society report (1999), cancer surpasses heart disease as the leading cause of death in the United States of America (USA) for people of age less than 85. Thus, medical research in cancer is an important public health interest. Understanding how medical improvements are affecting cancer incidence, mortality and survival is critical for effective cancer control. In this paper, we study the cancer survival trend on the population level cancer data. In particular, we develop a parametric Bayesian joinpoint regression model based on a Poisson distribution for the relative survival. To avoid identifying the cause of death, we only conduct analysis based on the relative survival. The method is further extended to the semiparametric Bayesian joinpoint regression models wherein the parametric distributional assumptions of the joinpoint regression models are relaxed by modeling the distribution of regression slopes using Dirichlet process mixtures. We also consider the effect of adding covariates of interest in the joinpoint model. Three model selection criteria, namely, the conditional predictive ordinate (CPO), the expected predictive deviance (EPD), and the deviance information criteria (DIC), are used to select the number of joinpoints. We analyze the grouped survival data for distant testicular cancer from the Surveillance, Epidemiology, and End Results (SEER) Program using these Bayesian models.
An Approximate Bayesian Bootstrap (ABB) offers advantages in incorporating appropriate uncertainty when imputing missing data, but most implementations of the ABB have lacked the ability to handle nonignorable missing data where the probability of missingness depends on unobserved values. This paper outlines a strategy for using an ABB to multiply impute nonignorable missing data. The method allows the user to draw inferences and perform sensitivity analyses when the missing data mechanism cannot automatically be assumed to be ignorable. Results from imputing missing values in a longitudinal depression treatment trial as well as a simulation study are presented to demonstrate the method's performance. We show that a procedure that uses a different type of ABB for each imputed data set accounts for appropriate uncertainty and provides nominal coverage.
A range of point process models which are commonly used in spatial epidemiology applications for the increased incidence of disease are compared. The models considered vary from approximate methods to an exact method. The approximate methods include the Poisson process model and methods that are based on discretization of the study window. The exact method includes a marked point process model, i.e., the conditional logistic model. Apart from analyzing a real dataset (Lancashire larynx cancer data), a small simulation study is also carried out to examine the ability of these methods to recover known parameter values. The main results are as follows. In estimating the distance effect of larynx cancer incidences from the incinerator, the conditional logistic model and the binomial model for the discretized window perform relatively well. In explaining the spatial heterogeneity, the Poisson model (or the log Gaussian Cox process model) for the discretized window produces the best estimate.
Exact calculations of model posterior probabilities or related quantities are often infeasible due to the analytical intractability of predictive densities. Here new approximations to obtain predictive densities are proposed and contrasted with those based on the Laplace method. Our theory and a numerical study indicate that the proposed methods are easy to implement, computationally efficient, and accurate over a wide range of hyperparameters. In the context of GLMs, we show that they can be employed to facilitate the posterior computation under three general classes of informative priors on regression coefficients. A real example is provided to demonstrate the feasibility and usefulness of the proposed methods in a fully Bayes variable selection procedure.
A linkage study of a qualitative disease endophenotype in a sample of sib pairs, consisting of one disease affected proband and one sibling is considered. The linkage statistic compares marker allele sharing with the proband in siblings with an abnormal endophenotype to siblings with the normal endophenotype. Expressions for the distribution of this linkage statistic, in terms of the recombination fraction are derived and (1) the genetic parameter values (allele frequency and endophenotype and disease penetrance) and (2) the abnormal endophenotype rates in the population and in classes of relatives of disease affected probands. It is then shown that when either the disease or the abnormal endophenotype has additive penetrance, the expressions simplify to a monotonic function of the difference between abnormal endophenotype rates in siblings and in the population. Thought disorder is considered as a putative schizophrenia endophenotype. Forty sets of genetic parameter values that correspond to the known prevalence values for thought disorder in schizophrenic patients, siblings of schizophrenics and the general population are evaluated. For these genetic parameter values, numerical results show that the test statistic has>70% power (α = 0.0001) in general with a sample of 200 or more proband-sibling pairs to detect the linkage between a marker (θ = 0.01), and a locus pleiotropic for schizophrenia and thought disorder.
Three recent nonparametric methodologies for estimating a monotone regression function F and its inverse F (-1) are (1) the inverse kernel method DNP (Dette et al. (2005), Dette and Scheder (2010)), (2) the monotone spline (Kong and Eubank (2006)) and (3) the data adaptive method NAM (Bhattacharya and Lin (2010), (2011)), with roots in isotonic regression (Ayer et al. (1955), Bhattacharya and Kong (2007)). All three have asymptotically optimal error rates. In this article their finite sample performances are compared using extensive simulation from diverse models of interest, and by analysis of real data. Let there be m distinct values of the independent variable x among N observations y. The results show that if m is relatively small compared to N then generally the NAM performs best, while the DNP outperforms the other methods when m is O(N) unless there is a substantial clustering of the values of the independent variable x.
Multivariate extensions of well-known linear mixed-effects models have been increasingly utilized in inference by multiple imputation in the analysis of multilevel incomplete data. The normality assumption for the underlying error terms and random effects plays a crucial role in simulating the posterior predictive distribution from which the multiple imputations are drawn. The plausibility of this normality assumption on the subject-specific random effects is assessed. Specifically, the performance of multiple imputation created under a multivariate linear mixed-effects model is investigated on a diverse set of incomplete data sets simulated under varying distributional characteristics. Under moderate amounts of missing data, the simulation study confirms that the underlying model leads to a well-calibrated procedure with negligible biases and actual coverage rates close to nominal rates in estimates of the regression coefficients. Estimation quality of the random-effect variance and association measures, however, are negatively affected from both the misspecification of the random-effect distribution and number of incompletely-observed variables. Some of the adverse impacts include lower coverage rates and increased biases.
In order to take into account the complex genomic distribution of SNP variations when identifying chromosomal regions with significant SNP effects, a single nucleotide polymorphism (SNP) association scan statistic was developed. To address the computational needs of genome wide association (GWA) studies, a fast Java application, which combines single-locus SNP tests and a scan statistic for identifying chromosomal regions with significant clusters of significant SNP effects, was developed and implemented. To illustrate this application, SNP associations were analyzed in a pharmacogenomic study of the blood pressure lowering effect of thiazide-diuretics (N=195) using the Affymetrix Human Mapping 100K Set. 55,335 tagSNPs (pair-wise linkage disequilibrium R(2)<0.5) were selected to reduce the frequency correlation between SNPs. A typical workstation can complete the whole genome scan including 10,000 permutation tests within 3 hours. The most significant regions locate on chromosome 3, 6, 13 and 16, two of which contain candidate genes that may be involved in the underlying drug response mechanism. The computational performance of ChromoScan-GWA and its scalability were tested with up to 1,000,000 SNPs and up to 4,000 subjects. Using 10,000 permutations, the computation time grew linearly in these datasets. This scan statistic application provides a robust statistical and computational foundation for identifying genomic regions associated with disease and provides a method to compare GWA results even across different platforms.
Fine particulate matter (PM(2.5)) is a mixture of pollutants that has been linked to serious health problems, including premature mortality. Since the chemical composition of PM(2.5) varies across space and time, the association between PM(2.5) and mortality could also change with space and season. In this work we develop and implement a statistical multi-stage Bayesian framework that provides a very broad, flexible approach to studying the spatiotemporal associations between mortality and population exposure to daily PM(2.5) mass, while accounting for different sources of uncertainty. In stage 1, we map ambient PM(2.5) air concentrations using all available monitoring data (IMPROVE and FRM) and an air quality model (CMAQ) at different spatial and temporal scales. In stage 2, we examine the spatial temporal relationships between the health end-points and the exposures to PM(2.5) by introducing a spatial-temporal generalized Poisson regression model. We adjust for time-varying confounders, such as seasonal trends. A common seasonal trends model is to use a fixed number of basis functions to account for these confounders, but the results can be sensitive to the number of basis functions. In this study, the number of the basis functions is treated as an unknown parameter in our Bayesian model and we use a space-time stochastic search variable selection approach. We apply our methods to a data set in North Carolina for the year 2001.
This paper considers model-based methods for estimation of the adjusted attributable risk (AR) in both case-control and cohort studies. An earlier review discussed approaches for both types of studies, using the standard logistic regression model for case-control studies, and for cohort studies proposing the equivalent Poisson model in order to account for the additional variability in estimating the distribution of exposures and covariates from the data. In this paper we revisit case-control studies, arguing for the equivalent Poisson model in this case as well. Using the delta method with the Poisson model, we provide general expressions for the asymptotic variance of the AR for both types of studies. This includes the generalized AR, which extends the original idea of attributable risk to the case where the exposure is not completely eliminated. These variance expressions can be easily programmed in any statistical package that includes Poisson regression and has capabilities for simple matrix algebra. In addition, we discuss computation of standard errors and confidence limits using bootstrap resampling. For cohort studies, use of the bootstrap allows binary regression models with link functions other than the logit.
A broad range of studies of preventive measures in infectious diseases gives rise to incidence data from close contact groups. Parameters of common interest in such studies include transmission probabilities and efficacies of preventive or therapeutic interventions. We estimate these parameters using discrete-time likelihood models. We augment the data with unobserved pairwise transmission outcomes and fit the model using the EM algorithm. A linear model derived from the likelihood based on the augmented data and fitted with the iteratively re-weighted least squares method is also discussed. Using simulations, we demonstrate the comparable accuracy and lower sensitivity to initial estimates of the proposed methods with data augmentation relative to the likelihood model based solely on the observed data. Two randomized household-based trials of zanamivir, an influenza antiviral agent, are analyzed using the proposed methods.
We evaluate the performance of the Dirichlet process mixture (DPM) and the latent class model (LCM) in identifying autism phenotype subgroups based on categorical autism spectrum disorder (ASD) diagnostic features from the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition Text Revision. A simulation study is designed to mimic the diagnostic features in the ASD dataset in order to evaluate the LCM and DPM methods in this context. Likelihood based information criteria and DPM partitioning are used to identify the best fitting models. The Rand statistic is used to compare the performance of the methods in recovering simulated phenotype subgroups. Our results indicate excellent recovery of the simulated subgroup structure for both methods. The LCM performs slightly better than DPM when the correct number of latent subgroups is selected a priori. The DPM method utilizes a maximum a posteriori (MAP) criterion to estimate the number of classes, and yielded results in fair agreement with the LCM method. Comparison of model fit indices in identifying the best fitting LCM showed that adjusted Bayesian information criteria (ABIC) picks the correct number of classes over 90% of the time. Thus, when diagnostic features are categorical and there is some prior information regarding the number of latent classes, LCM in conjunction with ABIC is preferred.
A spatial process observed over a lattice or a set of irregular regions is usually modeled using a conditionally autoregressive (CAR) model. The neighborhoods within a CAR model are generally formed deterministically using the inter-distances or boundaries between the regions. An extension of CAR model is proposed in this article where the selection of the neighborhood depends on unknown parameter(s). This extension is called a Stochastic Neighborhood CAR (SNCAR) model. The resulting model shows flexibility in accurately estimating covariance structures for data generated from a variety of spatial covariance models. Specific examples are illustrated using data generated from some common spatial covariance functions as well as real data concerning radioactive contamination of the soil in Switzerland after the Chernobyl accident.
Many clinical trials compare the efficacy of K (≥3) treatments in repeated measurement studies. However, the design of such trials have received relatively less attention from researchers. Zhang & Ahn (2012) derived a closed-form sample size formula for two-sample comparisons of time-averaged responses using the generalized estimating equation (GEE) approach, which takes into account different correlation structures and missing data patterns. In this paper, we extend the sample size formula to scenarios where K (≥3) treatments are compared simultaneously to detect time-averaged differences in treatment effect. A closed-form sample size formula based on the noncentral χ(2) test statistic is derived. We conduct simulation studies to assess the performance of the proposed sample size formula under various correlation structures from a damped exponential family, random and monotone missing patterns, and different observation probabilities. Simulation studies show that empirical powers and type I errors are close to their nominal levels. The proposed sample size formula is illustrated using a real clinical trial example.
Predictors of random effects are usually based on the popular mixed effects model developed under the assumption that the sample is obtained from a conceptual infinite population; such predictors are employed even when the actual population is finite. Two alternatives that incorporate the finite nature of the population are obtained from the superpopulation model proposed by Scott and Smith (1969, JASA, 64: 830-840) or from the finite population mixed model recently proposed by Stanek and Singer (2004, JASA, 99:1119-1130). Predictors derived under the latter model with the additional assumptions that all variance components are known and that within-cluster variances are equal have smaller mean squared error than the competitors based on either the mixed effects or Scott and Smith's models. As population variances are rarely known, we propose method of moment estimators to obtain empirical predictors and conduct a simulation study to evaluate their performance. The results suggest that the finite population mixed model empirical predictor is more stable than its competitors since, in terms of mean squared error, it is either the best or the second best and when second best, its performance lies within acceptable limits. When both cluster and unit intra-class correlation coefficients are very high (e.g., 0.95 or more), the performance of the empirical predictors derived under the three models is similar.
A new procedure is proposed to balance type I and II errors in significance testing for differential expression of individual genes. Suppose that a collection, F(k), of k lists of selected genes is available, each of them approximating by their content the true set of differentially expressed genes. For example, such sets can be generated by a subsampling counterpart of the delete-d-jackknife method controlling the per-comparison error rate for each subsample. A final list of candidate genes, denoted by S(*), is composed in such a way that its contents be closest in some sense to all the sets thus generated. To measure "closeness" of gene lists, we introduce an asymmetric distance between sets with its asymmetry arising from a generally unequal assignment of the relative costs of type I and type II errors committed in the course of gene selection. The optimal set S(*) is defined as a minimizer of the average asymmetric distance from an arbitrary set S to all sets in the collection F(k). The minimization problem can be solved explicitly, leading to a frequency criterion for the inclusion of each gene in the final set. The proposed method is tested by resampling from real microarray gene expression data with artificially introduced shifts in expression levels of pre-defined genes, thereby mimicking their differential expression.
This paper evaluates the effect of ignoring baseline when modeling transitions from intact cognition to dementia with mild cognitive impairment (MCI) and global impairment (GI) as intervening cognitive states. Transitions among states are modeled by a discrete-time Markov chain having three transient (intact cognition, MCI, and GI) and two competing absorbing states (death and dementia). Transition probabilities depend on two covariates, age and the presence/absence of an apolipoprotein E-epsilon4 allele, through a multinomial logistic model with shared random effects. Results are illustrated with an application to the Nun Study, a cohort of 678 participants 75+ years of age at baseline and followed longitudinally with up to ten cognitive assessments per nun.
Empirical Bayes Hastings sampler estimated frailty distributions from each of the four simulation experiments.
This figure will appear as part of the online supplement to accompany appendix Section 10.3
Studies of ocular disease and analyses of time to disease onset are complicated by the correlation expected between the two eyes from a single patient. We overcome these statistical modeling challenges through a nonparametric Bayesian frailty model. While this model suggests itself as a natural one for such complex data structures, model fitting routines become overwhelmingly complicated and computationally intensive given the nonparametric form assumed for the frailty distribution and baseline hazard function. We consider empirical Bayesian methods to alleviate these difficulties through a routine that iterates between frequentist, data-driven estimation of the cumulative baseline hazard and Markov chain Monte Carlo estimation of the frailty and regression coefficients. We show both in theory and through simulation that this approach yields consistent estimators of the parameters of interest. We then apply the method to the short-wave automated perimetry (SWAP) data set to study risk factors of glaucomatous visual field deficits.
We present a Bayesian variable selection method for the setting in which the number of independent variables or predictors in a particular dataset is much larger than the available sample size. While most existing methods allow some degree of correlations among predictors but do not consider these correlations for variable selection, our method accounts for correlations among the predictors in variable selection. Our correlation-based stochastic search (CBS) method, the hybrid-CBS algorithm, extends a popular search algorithm for high-dimensional data, the stochastic search variable selection (SSVS) method. Similar to SSVS, we search the space of all possible models using variable addition, deletion or swap moves. However, our moves through the model space are designed to accommodate correlations among the variables. We describe our approach for continuous, binary, ordinal, and count outcome data. The impact of choices of prior distributions and hyper-parameters is assessed in simulation studies. We also examined performance of variable selection and prediction as the correlation structure of the predictors varies. We found that the hybrid-CBS resulted in lower prediction errors and better identified the true outcome associated predictors than SSVS when predictors were moderately to highly correlated. We illustrate the method on data from a proteomic profiling study of melanoma, a skin cancer.
Doubly-censored data refers to time to event data for which both the originating and failure times are censored. In studies involving AIDS incubation time or survival after dementia onset, for example, data are frequently doubly-censored because the date of the originating event is interval-censored and the date of the failure event usually is right-censored. The primary interest is in the distribution of elapsed times between the originating and failure events and its relationship to exposures and risk factors. The estimating equation approach [Sun, et al. 1999. Regression analysis of doubly censored failure time data with applications to AIDS studies. Biometrics 55, 909-914] and its extensions assume the same distribution of originating event times for all subjects. This paper demonstrates the importance of utilizing additional covariates to impute originating event times, i.e., more accurate estimation of originating event times may lead to less biased parameter estimates for elapsed time. The Bayesian MCMC method is shown to be a suitable approach for analyzing doubly-censored data and allows a rich class of survival models. The performance of the proposed estimation method is compared to that of other conventional methods through simulations. Two examples, an AIDS cohort study and a population-based dementia study, are used for illustration. Sample code is shown in the appendix.
A Bayesian analysis of stochastic volatility (SV) models using the class of symmetric scale mixtures of normal (SMN) distributions is considered. In the face of non-normality, this provides an appealing robust alternative to the routine use of the normal distribution. Specific distributions examined include the normal, student-t, slash and the variance gamma distributions. Using a Bayesian paradigm, an efficient Markov chain Monte Carlo (MCMC) algorithm is introduced for parameter estimation. Moreover, the mixing parameters obtained as a by-product of the scale mixture representation can be used to identify outliers. The methods developed are applied to analyze daily stock returns data on S&P500 index. Bayesian model selection criteria as well as out-of- sample forecasting results reveal that the SV models based on heavy-tailed SMN distributions provide significant improvement in model fit as well as prediction to the S&P500 index data over the usual normal model.
Enfuvirtide (ENF) is a fusion inhibitor that prevents the entry of HIV virions into target cells. Studying the characteristics of viral evolution during treatment and after a treatment interruption can lend insight into the mechanisms of viral evolution and fitness. Although interruption of anti-HIV therapy often results in rapid emergence of an archived "wild-type" virus population, previous work from our group indicates that when only ENF is interrupted, viral gp41 continues to evolve forward and resistance mutations are lost due to back-mutation and remodeling of the envelope protein. To examine the co-evolution of gp120 and gp41 during ENF interruption we extend the Bayesian Hierarchical Phylogenetic model (HPM). Current HPMs enforce conditional independence across all outcomes while biologically all gene regions within a patient should return the same tree unless recombination confers an evolutionary selective advantage. A two-way-interaction HPM is proposed that provides middle ground between these two extremes and allows us to test for differences in evolutionary pressures across gene regions in multiple patients simultaneously. When the model is applied to a well-characterized cohort of HIV-infected patients interrupting ENF we find that across patients, the virus continued to evolve forward in both gene regions. Overall, the hypothesis of independence over dependence between the gene regions is supported. Models that allow for the examination of co-evolution over time will be increasingly important as more therapeutic classes are developed, each of which may impact other through novel and complex mechanisms.
We model sparse functional data from multiple subjects with a mixed-effects regression spline. In this model, the expected values for any subject (conditioned on the random effects) can be written as the sum of a population curve and a subject-specific deviate from this population curve. The population curve and the subject-specific deviates are both modeled as free-knot b-splines with k and k' knots located at t(k) and t(k'), respectively. To identify the number and location of the "free" knots, we sample from the posterior p (k, t(k), k', t(k')|y) using reversible jump MCMC methods. Sampling from this posterior distribution is complicated, however, by the flexibility we allow for the model's covariance structure. No restrictions (other than positive definiteness) are placed on the covariance parameters ψ and σ(2) and, as a result, no analytical form for the likelihood p (y|k, t(k), k', t(k')) exists. In this paper, we consider two approximations to p(y|k, t(k), k', t(k')) and then sample from the corresponding approximations to p(k, t(k), k', t(k')|y). We also sample from p(k, t(k), k', t(k'), ψ, σ(2)|y) which has a likelihood that is available in closed form. While sampling from this larger posterior is less efficient, the resulting marginal distribution of knots is exact and allows us to evaluate the accuracy of each approximation. We then consider a real data set and explore the difference between p(k, t(k), k', t(k'), ψ, σ(2)|y) and the more accurate approximation to p(k, t(k), k', t(k')|y).
Diarrhoea-associated Haemolytic uraemic syndrome (HUS) is a disease that affects the kidneys and other organs. Motivated by the annual number of cases of HUS collected in Birmingham and Newcastle of England, respectively, from 1970 to 1989, we consider Bayesian changepoint analysis with specific attention to Poisson changepoint models. For changepoint models with unknown number of changepoints, we propose a new non-iterative Bayesian sampling approach (called exact IBF sampling), which completely avoids the problem of convergence and slow convergence associated with iterative Markov chain Monte Carlo (MCMC) methods. The idea is to first utilize the sampling inverse Bayes formula (IBF) to derive the conditional distribution of the latent data given the observed data, and then to draw iid samples from the complete-data posterior distribution. For the purpose of selecting the appropriate model (or determining the number of changepoints), we develop two alternative formulae to exactly calculate marginal likelihood (or Bayes factor) by using the exact IBF output and the point-wise IBF, respectively. The HUS data are re-analyzed using the proposed methods. Simulations are implemented to validate the performance of the proposed methods.
Bayesian models are increasingly used to analyze complex multivariate outcome data. However, diagnostics for such models have not been well-developed. We present a diagnostic method of evaluating the fit of Bayesian models for multivariate data based on posterior predictive model checking (PPMC), a technique in which observed data are compared to replicated data generated from model predictions. Most previous work on PPMC has focused on the use of test quantities that are scalar summaries of the data and parameters. However, scalar summaries are unlikely to capture the rich features of multivariate data. We introduce the use of dissimilarity measures for checking Bayesian models for multivariate outcome data. This method has the advantage of checking the fit of the model to the complete data vectors or vector summaries with reduced dimension, providing a comprehensive picture of model fit. An application with longitudinal binary data illustrates the methods.
The multinomial probit model has emerged as a useful framework for modeling nominal categorical data, but extending such models to multivariate measures presents computational challenges. Following a Bayesian paradigm, we use a Markov chain Monte Carlo (MCMC) method to analyze multivariate nominal measures through multivariate multinomial probit models. As with a univariate version of the model, identification of model parameters requires restrictions on the covariance matrix of the latent variables that are introduced to define the probit specification. To sample the covariance matrix with restrictions within the MCMC procedure, we use a parameter-extended Metropolis-Hastings algorithm that incorporates artificial variance parameters to transform the problem into a set of simpler tasks including sampling an unrestricted covariance matrix. The parameter-extended algorithm also allows for flexible prior distributions on covariance matrices. The prior specification in the method described here generalizes earlier approaches to analyzing univariate nominal data, and the multivariate correlation structure in the method described here generalizes the autoregressive structure proposed in previous multiperiod multinomial probit models. Our methodology is illustrated through a simulated example and an application to a cancer-control study aiming to achieve early detection of breast cancer.
Top-cited authors
Jerome H. Friedman
  • Stanford University
Michel Tenenhaus
Vincenzo Esposito Vinzi
  • ESSEC - Ecole Supérieur des Sciences Economiques et Commerciales
Carlo Lauro
  • University of Naples Federico II
Debasis Kundu
  • Indian Institute of Technology Kanpur