Journal of Statistical Planning and Inference

Published by Elsevier
Online ISSN: 0378-3758
Publications
Article
Toxicologists and pharmacologists often describe toxicity of a chemical using parameters of a nonlinear regression model. Thus estimation of parameters of a nonlinear regression model is an important problem. The estimates of the parameters and their uncertainty estimates depend upon the underlying error variance structure in the model. Typically, a priori the researcher would know if the error variances are homoscedastic (i.e., constant across dose) or if they are heteroscedastic (i.e., the variance is a function of dose). Motivated by this concern, in this article we introduce an estimation procedure based on preliminary test which selects an appropriate estimation procedure accounting for the underlying error variance structure. Since outliers and influential observations are common in toxicological data, the proposed methodology uses M-estimators. The asymptotic properties of the preliminary test estimator are investigated; in particular its asymptotic covariance matrix is derived. The performance of the proposed estimator is compared with several standard estimators using simulation studies. The proposed methodology is also illustrated using a data set obtained from the National Toxicology Program.
 
Article
In the cases with three ordinal diagnostic groups, the important measures of diagnostic accuracy are the volume under surface (VUS) and the partial volume under surface (PVUS) which are the extended forms of the area under curve (AUC) and the partial area under curve (PAUC). This article addresses confidence interval estimation of the difference in paired VUS s and the difference in paired PVUS s. To focus especially on studies with small to moderate sample sizes, we propose an approach based on the concepts of generalized inference. A Monte Carlo study demonstrates that the proposed approach generally can provide confidence intervals with reasonable coverage probabilities even at small sample sizes. The proposed approach is compared to a parametric bootstrap approach and a large sample approach through simulation. Finally, the proposed approach is illustrated via an application to a data set of blood test results of anemia patients.
 
Article
We apply a linear programming approach which uses the causal risk difference (RD(C)) as the objective function and provides minimum and maximum values that RD(C) can achieve under any set of linear constraints on the potential response type distribution. We consider two scenarios involving binary exposure X, covariate Z and outcome Y. In the first, Z is not affected by X, and is a potential confounder of the causal effect of X on Y. In the second, Z is affected by X and intermediate in the causal pathway between X and Y. For each scenario we consider various linear constraints corresponding to the presence or absence of arcs in the associated directed acyclic graph (DAG), monotonicity assumptions, and presence or absence of additive-scale interactions. We also estimate Z-stratum-specific bounds when Z is a potential effect measure modifier and bounds for both controlled and natural direct effects when Z is affected by X. In the absence of any additional constraints deriving from background knowledge, the well-known bounds on RDc are duplicated: -Pr(Y not equalX) </= RD(C) </= Pr(Y=X). These bounds have unit width, but can be narrowed by background knowledge-based assumptions. We provide and compare bounds and bound widths for various combinations of assumptions in the two scenarios and apply these bounds to real data from two studies.
 
Article
The problem of selecting the correct subset of predictors within a linear model has received much attention in recent literature. Within the Bayesian framework, a popular choice of prior has been Zellner's g-prior which is based on the inverse of empirical covariance matrix of the predictors. An extension of the Zellner's prior is proposed in this article which allow for a power parameter on the empirical covariance of the predictors. The power parameter helps control the degree to which correlated predictors are smoothed towards or away from one another. In addition, the empirical covariance of the predictors is used to obtain suitable priors over model space. In this manner, the power parameter also helps to determine whether models containing highly collinear predictors are preferred or avoided. The proposed power parameter can be chosen via an empirical Bayes method which leads to a data adaptive choice of prior. Simulation studies and a real data example are presented to show how the power parameter is well determined from the degree of cross-correlation within predictors. The proposed modification compares favorably to the standard use of Zellner's prior and an intrinsic prior in these examples.
 
Article
In clinical trials, several competing treatments are often carried out in the same trial period. The goal is to assess the performances of these different treatments according to some optimality criterion and minimize risks to the patients in the entire process of the study. For this, each coming patient is allocated sequentially to one of the treatments according to a mechanism defined by the optimality criterion. In practice, sometimes different optimality criteria, or the same criterion with different regimes, need to be considered to assess the treatments in the same study, so that each mechanism is also evaluated through the trail study. In this case, the question is how to allocate the treatments to the incoming patients so that the criteria/mechanisms of interest are assessed during the trail process, and the overall performance of the trial is optimized under the combined criteria or regimes. In this paper, we consider this problem by investigating a compound adaptive generalized Pólya urn design. Basic asymptotic properties of this design are also studied.
 
Article
We consider a problem of estimating the minimum effective and peak doses in the presence of covariates. We propose a sequential strategy for subject assignment that includes an adaptive randomization component to balance the allocation to placebo and active doses with respect to covariates. We conclude that either adjusting for covariates in the model or balancing allocation with respect to covariates is required to avoid bias in the target dose estimation. We also compute optimal allocation to estimate the minimum effective and peak doses in discrete dose space using isotonic regression.
 
Article
Covariate adjusted regression (CAR) is a recently proposed adjustment method for regression analysis where both the response and predictors are not directly observed (Şentürk and Müller, 2005). The available data has been distorted by unknown functions of an observable confounding covariate. CAR provides consistent estimators for the coefficients of the regression between the variables of interest, adjusted for the confounder. We develop a broader class of partial covariate adjusted regression (PCAR) models to accommodate both distorted and undistorted (adjusted/unadjusted) predictors. The PCAR model allows for unadjusted predictors, such as age, gender and demographic variables, which are common in the analysis of biomedical and epidemiological data. The available estimation and inference procedures for CAR are shown to be invalid for the proposed PCAR model. We propose new estimators and develop new inference tools for the more general PCAR setting. In particular, we establish the asymptotic normality of the proposed estimators and propose consistent estimators of their asymptotic variances. Finite sample properties of the proposed estimators are investigated using simulation studies and the method is also illustrated with a Pima Indians diabetes data set.
 
Article
We consider the multiple comparison problem where multiple outcomes are each compared among several different collections of groups in a multiple group setting. In this case there are several different types of hypotheses, with each specifying equality of the distributions of a single outcome over a different collection of groups. Each type of hypothesis requires a different permutational approach. We show that under a certain multivariate condition it is possible to use closure over all hypotheses, although intersection hypotheses are tested using Boole's inequality in conjunction with permutation distributions in some cases. Shortcut tests are then found so that the resulting testing procedure is easily performed. The error rate and power of the new method is compared to existing competitors through simulation of correlated data. An example is analyzed, consisting of multiple adverse events in a clinical trial.
 
Article
Two statistical scoring procedures based on p-values have been developed to evaluate the overall performance of analytical laboratories performing environmental measurements. The overall score of bias and standing are used to determine how consistently a laboratory is able to measure the true (unknown) value correctly over time. The overall score of precision and standing are used to determine how well a laboratory is able to reproduce its measurements in the long run. Criteria are established for qualitatively labeling measurements as Acceptable, Warning, and Not Acceptable, and for identifying areas where laboratories should re-evaluate their measurement procedures. These statistical scoring procedures are applied to two real environmental data sets.
 
Article
Open label and single blinded randomized controlled clinical trials are vulnerable to selection bias when the next treatment assignment is predictable based on the randomization algorithm and the preceding assignment history. While treatment predictability is an issue for all constrained randomization algorithms, deterministic assignments are unique to permuted block randomization. Deterministic assignments may lead to treatment predictability with certainty and selection bias, which could inflate the type I error and hurts the validity of trial results. It is important to accurately evaluate the probability of deterministic assignments in permuted block randomization, so proper protection measures can be implemented. For trials with number of treatment arms T = 2 and a balance block size B = 2m, Matts and Lachin indicated that the probability of deterministic assignment is 1m+1. For more general situations, with T ≥ 2 and a block size B=∑j=1Tmj, Dupin-Spriet provided a formula, which can be written as 1B∑j=1T∑i=1mj∏k=1imj-k+1B-k+1. This formula involves extensive calculation in evaluation. In this paper, we simplified this formula to 1B∑j=1TmjB-mj+1 for general scenarios and 1B-m+1 for trials with a balanced allocation. Through mathematical induction we show the equivalence of the formulas. While the new formula is numerically equivalent to Dupin-Spriet's formula, the simple format not only is easier for evaluation, but also is clearer in describing the impact of parameters T and m(i) on the probability of deterministic assignments.
 
Article
In many diagnostic studies, multiple diagnostic tests are performed on each subject or multiple disease markers are available. Commonly, the information should be combined to improve the diagnostic accuracy. We consider the problem of comparing the discriminatory abilities between two groups of biomarkers. Specifically, this article focuses on confidence interval estimation of the difference between paired AUCs based on optimally combined markers under the assumption of multivariate normality. Simulation studies demonstrate that the proposed generalized variable approach provides confidence intervals with satisfying coverage probabilities at finite sample sizes. The proposed method can also easily provide P-values for hypothesis testing. Application to analysis of a subset of data from a study on coronary heart disease illustrates the utility of the method in practice.
 
Article
In the literature on change-point analysis, much attention has been paid to detecting changes in certain marginal characteristics, such as mean, variance, and marginal distribution. For time series data with nonparametric time trend, we study the change-point problem for the autocovariance structure of the unobservable error process. To derive the asymptotic distribution of the cumulative sum test statistic, we develop substantial theory for uniform convergence of weighted partial sums and weighted quadratic forms. Our asymptotic results improve upon existing works in several important aspects. The performance of the test statistic is examined through simulations and an application to interest rates data.
 
Article
We develop in this paper a new procedure to construct simultaneous confidence bands for derivatives of mean curves in functional data analysis. The technique involves polynomial splines that provide an approximation to the derivatives of the mean functions, the covariance functions and the associated eigenfunctions. We show that the proposed procedure has desirable statistical properties. In particular, we first show that the proposed estimators of derivatives of the mean curves are semiparametrically efficient. Second, we establish consistency results for derivatives of covariance functions and their eigenfunctions. Most importantly, we show that the proposed spline confidence bands are asymptotically efficient as if all random trajectories were observed with no error. Finally, the confidence band procedure is illustrated through numerical simulation studies and a real life example.
 
Article
Bayes methodology provides posterior distribution functions based on parametric likelihoods adjusted for prior distributions. A distribution-free alternative to the parametric likelihood is use of empirical likelihood (EL) techniques, well known in the context of nonparametric testing of statistical hypotheses. Empirical likelihoods have been shown to exhibit many of the properties of conventional parametric likelihoods. In this article, we propose and examine Bayes factors (BF) methods that are derived via the EL ratio approach. Following Kass & Wasserman [10], we consider Bayes factors type decision rules in the context of standard statistical testing techniques. We show that the asymptotic properties of the proposed procedure are similar to the classical BF's asymptotic operating characteristics. Although we focus on hypothesis testing, the proposed approach also yields confidence interval estimators of unknown parameters. Monte Carlo simulations were conducted to evaluate the theoretical results as well as to demonstrate the power of the proposed test.
 
Article
The study of HIV dynamics is one of the most important developments in recent AIDS research for understanding the pathogenesis of HIV-1 infection and antiviral treatment strategies. Currently a large number of AIDS clinical trials on HIV dynamics are in development worldwide. However, many design issues that arise from AIDS clinical trials have not been addressed. In this paper, we use a simulation-based approach to deal with design problems in Bayesian hierarchical nonlinear (mixed-effects) models. The underlying model characterizes the long-term viral dynamics with antiretroviral treatment where we directly incorporate drug susceptibility and exposure into a function of treatment efficacy. The Bayesian design method is investigated under the framework of hierarchical Bayesian (mixed-effects) models. We compare a finite number of feasible candidate designs numerically, which are currently used in AIDS clinical trials from different perspectives, and provide guidance on how a design might be chosen in practice.
 
Article
Ratio estimators of effect are ordinarily obtained by exponentiating maximum-likelihood estimators (MLEs) of log-linear or logistic regression coefficients. These estimators can display marked positive finite-sample bias, however. We propose a simple correction that removes a substantial portion of the bias due to exponentiation. By combining this correction with bias correction on the log scale, we demonstrate that one achieves complete removal of second-order bias in odds ratio estimators in important special cases. We show how this approach extends to address bias in odds or risk ratio estimators in many common regression settings. We also propose a class of estimators that provide reduced mean bias and squared error, while allowing the investigator to control the risk of underestimating the true ratio parameter. We present simulation studies in which the proposed estimators are shown to exhibit considerable reduction in bias, variance, and mean squared error compared to MLEs. Bootstrapping provides further improvement, including narrower confidence intervals without sacrificing coverage.
 
Article
Many analyses for incomplete longitudinal data are directed to examining the impact of covariates on the marginal mean responses. We consider the setting in which longitudinal responses are collected from individuals nested within clusters. We discuss methods for assessing covariate effects on the mean and association parameters when covariates are incompletely observed. Weighted first and second order estimating equations are constructed to obtain consistent estimates of mean and association parameters when covariates are missing at random. Empirical studies demonstrate that estimators from the proposed method have negligible finite sample biases in moderate samples. An application to the National Alzheimer's Coordinating Center (NACC) Uniform Data Set (UDS) demonstrates the utility of the proposed method.
 
Article
Many syndromes traditionally viewed as individual diseases are heterogeneous in molecular pathogenesis and treatment responsiveness. This often leads to the conduct of large clinical trials to identify small average treatment benefits for heterogeneous groups of patients. Drugs that demonstrate effectiveness in such trials may subsequently be used broadly, resulting in ineffective treatment of many patients. New genomic and proteomic technologies provide powerful tools for the selection of patients likely to benefit from a therapeutic without unacceptable adverse events. In spite of the large literature on developing predictive biomarkers, there is considerable confusion about the development and validation of biomarker based diagnostic classifiers for treatment selection. In this paper we attempt to clarify some of these issues and to provide guidance on the design of clinical trials for evaluating the clinical utility and robustness of pharmacogenomic classifiers.
 
Article
We consider asymptotic properties of the maximum likelihood and related estimators in a clustered logistic joinpoint model with an unknown joinpoint. Sufficient conditions are given for the consistency of confidence bounds produced by the parametric bootstrap; one of the conditions required is that the true location of the joinpoint is not at one of the observation times. A simulation study is presented to illustrate the lack of consistency of the bootstrap confidence bounds when the joinpoint is an observation time. A removal algorithm is presented which corrects this problem, but at the price of an increased mean square error. Finally, the methods are applied to data on yearly cancer mortality in the United States for individuals age 65 and over.
 
Article
Studies of diagnostic tests are often designed with the goal of estimating the area under the receiver operating characteristic curve (AUC) because the AUC is a natural summary of a test's overall diagnostic ability. However, sample size projections dealing with AUCs are very sensitive to assumptions about the variance of the empirical AUC estimator, which dependens on two correlation parameters. While these correlation parameters can be estimated from available data, in practice it is hard to find reliable estimates before the study is conducted. Here we derive achievable bounds on the projected sample size that are free of these two correlation parameters. The lower bound is the smallest sample size that would yield the desired level of precision for some model, while the upper bound is the smallest sample size that would yield the desired level of precision for all models. These bounds are important reference points when designing a single or multi-arm study; they are the absolute minimum and maximum sample size that would ever be required. When the study design includes multiple readers or interpreters of the test, we derive bounds pertaining to the average reader AUC and the 'pooled' or overall AUC for the population of readers. These upper bounds for multireader studies are not too conservative when several readers are involved.
 
Article
Sequential designs can be used to save computation time in implementing Monte Carlo hypothesis tests. The motivation is to stop resampling if the early resamples provide enough information on the significance of the p-value of the original Monte Carlo test. In this paper, we consider a sequential design called the B-value design proposed by Lan and Wittes and construct the sequential design bounding the resampling risk, the probability that the accept/reject decision is different from the decision from complete enumeration. For the B-value design whose exact implementation can be done by using the algorithm proposed in Fay, Kim and Hachey, we first compare the expected resample size for different designs with comparable resampling risk. We show that the B-value design has considerable savings in expected resample size compared to a fixed resample or simple curtailed design, and comparable expected resample size to the iterative push out design of Fay and Follmann. The B-value design is more practical than the iterative push out design in that it is tractable even for small values of resampling risk, which was a challenge with the iterative push out design. We also propose an approximate B-value design that can be constructed without using a specially developed software and provides analytic insights on the choice of parameter values in constructing the exact B-value design.
 
Article
We are concerned with the problem of estimating the treatment effects at the effective doses in a dose-finding study. Under monotone dose-response, the effective doses can be identified through the estimation of the minimum effective dose, for which there is an extensive set of statistical tools. In particular, when a fixed-sequence multiple testing procedure is used to estimate the minimum effective dose, Hsu and Berger (1999) show that the confidence lower bounds for the treatment effects can be constructed without the need to adjust for multiplicity. Their method, called the dose-response method, is simple to use, but does not account for the magnitude of the observed treatment effects. As a result, the dose-response method will estimate the treatment effects at effective doses with confidence bounds invariably identical to the hypothesized value. In this paper, we propose an error-splitting method as a variant of the dose-response method to construct confidence bounds at the identified effective doses after a fixed-sequence multiple testing procedure. Our proposed method has the virtue of simplicity as in the dose-response method, preserves the nominal coverage probability, and provides sharper bounds than the dose-response method in most cases.
 
Article
This article deals with quasi- and pseudo-likelihood estimation in a class of continuous-time multi-type Markov branching processes observed at discrete points in time. "Conventional" and conditional estimation are discussed for both approaches. We compare their properties and identify situations where they lead to asymptotically equivalent estimators. Both approaches possess robustness properties, and coincide with maximum likelihood estimation in some cases. Quasi-likelihood functions involving only linear combinations of the data may be unable to estimate all model parameters. Remedial measures exist, including the resort either to non-linear functions of the data or to conditioning the moments on appropriate sigma-algebras. The method of pseudo-likelihood may also resolve this issue. We investigate the properties of these approaches in three examples: the pure birth process, the linear birth-and-death process, and a two-type process that generalizes the previous two examples. Simulations studies are conducted to evaluate performance in finite samples.
 
Article
We summarize, review and comment upon three papers which discuss the use of discrete, noisy, incomplete, scattered pairwise dissimilarity data in statistical model building. Convex cone optimization codes are used to embed the objects into a Euclidean space which respects the dissimilarity information while controlling the dimension of the space. A "newbie" algorithm is provided for embedding new objects into this space. This allows the dissimilarity information to be incorporated into a Smoothing Spline ANOVA penalized likelihood model, a Support Vector Machine, or any model that will admit Reproducing Kernel Hilbert Space components, for nonparametric regression, supervised learning, or semi-supervised learning. Future work and open questions are discussed. The papers are: F. Lu, S. Keles, S. Wright and G. Wahba 2005. A framework for kernel regularization with application to protein clustering. Proceedings of the National Academy of Sciences 102, 12332-1233.G. Corrada Bravo, G. Wahba, K. Lee, B. Klein, R. Klein and S. Iyengar 2009. Examining the relative influence of familial, genetic and environmental covariate information in flexible risk models. Proceedings of the National Academy of Sciences 106, 8128-8133F. Lu, Y. Lin and G. Wahba. Robust manifold unfolding with kernel regularization. TR 1008, Department of Statistics, University of Wisconsin-Madison.
 
Article
We consider the semiparametric proportional hazards model for the cause-specific hazard function in analysis of competing risks data with missing cause of failure. The inverse probability weighted equation and augmented inverse probability weighted equation are proposed for estimating the regression parameters in the model, and their theoretical properties are established for inference. Simulation studies demonstrate that the augmented inverse probability weighted estimator is doubly robust and the proposed method is appropriate for practical use. The simulations also compare the proposed estimators with the multiple imputation estimator of Lu and Tsiatis (2001). The application of the proposed method is illustrated using data from a bone marrow transplant study.
 
Article
Failure times are often right-censored and left-truncated. In this paper we give a mass redistribution algorithm for right-censored and/or left-truncated failure time data. We show that this algorithm yields the Kaplan-Meier estimator of the survival probability. One application of this algorithm in modeling the subdistribution hazard for competing risks data is studied. We give a product-limit estimator of the cumulative incidence function via modeling the subdistribution hazard. We show by induction that this product-limit estimator is identical to the left-truncated version of Aalen-Johansen (1978) estimator for the cumulative incidence function.
 
Article
Censored median regression has proved useful for analyzing survival data in complicated situations, say, when the variance is heteroscedastic or the data contain outliers. In this paper, we study the sparse estimation for censored median regression models, which is an important problem for high dimensional survival data analysis. In particular, a new procedure is proposed to minimize an inverse-censoring-probability weighted least absolute deviation loss subject to the adaptive LASSO penalty and result in a sparse and robust median estimator. We show that, with a proper choice of the tuning parameter, the procedure can identify the underlying sparse model consistently and has desired large-sample properties including root-n consistency and the asymptotic normality. The procedure also enjoys great advantages in computation, since its entire solution path can be obtained efficiently. Furthermore, we propose a resampling method to estimate the variance of the estimator. The performance of the procedure is illustrated by extensive simulations and two real data applications including one microarray gene expression survival data.
 
Article
Procedures for estimating the parameters of the general class of semiparametric models for recurrent events proposed by Peña and Hollander (2004) are developed. This class of models incorporates an effective age function encoding the effect of changes after each event occurrence such as the impact of an intervention, it models the impact of accumulating event occurrences on the unit, it admits a link function in which the effect of possibly time-dependent covariates are incorporated, and it allows the incorporation of unobservable frailty components which induce dependencies among the inter-event times for each unit. The estimation procedures are semiparametric in that a baseline hazard function is nonparametrically specified. The sampling distribution properties of the estimators are examined through a simulation study, and the consequences of mis-specifying the model are analyzed. The results indicate that the flexibility of this general class of models provides a safeguard for analyzing recurrent event data, even data possibly arising from a frailtyless mechanism. The estimation procedures are applied to real data sets arising in the biomedical and public health settings, as well as from reliability and engineering situations. In particular, the procedures are applied to a data set pertaining to times to recurrence of bladder cancer and the results of the analysis are compared to those obtained using three methods of analyzing recurrent event data.
 
Article
One of the fundamental issues in analyzing microarray data is to determine which genes are expressed and which ones are not for a given group of subjects. In datasets where many genes are expressed and many are not expressed (i.e., underexpressed), a bimodal distribution for the gene expression levels often results, where one mode of the distribution represents the expressed genes and the other mode represents the underexpressed genes. To model this bimodality, we propose a new class of mixture models that utilize a random threshold value for accommodating bimodality in the gene expression distribution. Theoretical properties of the proposed model are carefully examined. We use this new model to examine the problem of differential gene expression between two groups of subjects, develop prior distributions, and derive a new criterion for determining which genes are differentially expressed between the two groups. Prior elicitation is carried out using empirical Bayes methodology in order to estimate the threshold value as well as elicit the hyperparameters for the two component mixture model. The new gene selection criterion is demonstrated via several simulations to have excellent false positive rate and false negative rate properties. A gastric cancer dataset is used to motivate and illustrate the proposed methodology.
 
A toy example: G 0 (top left), G 1 (top right), the MST based on G 0 (bottom).
Six simulated examples of unusual bivariate distributions; a sample of size N¼ 100 from each distribution.
A toy example: G0 (top left), G1 (top right), the MST based on G0 (bottom).
For N=50, the tail distribution of the p-value based on the normal approximation (solid line) and Monte-Carlo approximation (dashed line), respectively. Left and right panels show the entire distribution and a zoom on the relevant range of large F values, respectively.
Six simulated examples of unusual bivariate distributions; a sample of size N=100 from each distribution.
Article
A class of distribution-free tests is proposed for the independence of two subsets of response coordinates. The tests are based on the pairwise distances across subjects within each subset of the response. A complete graph is induced by each subset of response coordinates, with the sample points as nodes and the pairwise distances as the edge weights. The proposed test statistic depends only on the rank order of edges in these complete graphs. The response vector may be of any dimensions. In particular, the number of samples may be smaller than the dimensions of the response. The test statistic is shown to have a normal limiting distribution with known expectation and variance under the null hypothesis of independence. The exact distribution free null distribution of the test statistic is given for a sample of size 14, and its Monte-Carlo approximation is considered for larger sample sizes. We demonstrate in simulations that this new class of tests has good power properties for very general alternatives.
 
Article
Principal points are cluster means for theoretical distributions. A discriminant methodology based on principal points is introduced. The principal point classification method is useful in clinical trials where the goal is to distinguish and differentiate between different treatment effects. Particularly, in psychiatric studies where placebo response rates can be very high, the principal point classification is illustrated to distinguish specific drug responders from non-specific placebo responders.
 
Article
Receiver operating characteristic (ROC) curve, plotting true positive rates against false positive rates as threshold varies, is an important tool for evaluating biomarkers in diagnostic medicine studies. By definition, ROC curve is monotone increasing from 0 to 1 and is invariant to any monotone transformation of test results. And it is often a curve with certain level of smoothness when test results from the diseased and non-diseased subjects follow continuous distributions. Most existing ROC curve estimation methods do not guarantee all of these properties. One of the exceptions is Du and Tang (2009) which applies certain monotone spline regression procedure to empirical ROC estimates. However, their method does not consider the inherent correlations between empirical ROC estimates. This makes the derivation of the asymptotic properties very difficult. In this paper we propose a penalized weighted least square estimation method, which incorporates the covariance between empirical ROC estimates as a weight matrix. The resulting estimator satisfies all the aforementioned properties, and we show that it is also consistent. Then a resampling approach is used to extend our method for comparisons of two or more diagnostic tests. Our simulations show a significantly improved performance over the existing method, especially for steep ROC curves. We then apply the proposed method to a cancer diagnostic study that compares several newly developed diagnostic biomarkers to a traditional one.
 
Article
When data are missing, analyzing records that are completely observed may cause bias or inefficiency. Existing approaches in handling missing data include likelihood, imputation and inverse probability weighting. In this paper, we propose three estimators inspired by deleting some completely observed data in the regression setting. First, we generate artificial observation indicators that are independent of outcome given the observed data and draw inferences conditioning on the artificial observation indicators. Second, we propose a closely related weighting method. The proposed weighting method has more stable weights than those of the inverse probability weighting method (Zhao and Lipsitz, 1992). Third, we improve the efficiency of the proposed weighting estimator by subtracting the projection of the estimating function onto the nuisance tangent space. When data are missing completely at random, we show that the proposed estimators have asymptotic variances smaller than or equal to the variance of the estimator obtained from using completely observed records only. Asymptotic relative efficiency computation and simulation studies indicate that the proposed weighting estimators are more efficient than the inverse probability weighting estimators under wide range of practical situations especially when when the missingness proportion is large.
 
Article
This paper discusses characteristics of standard conjugate priors and their induced posteriors in Bayesian inference for von Mises-Fisher distributions, using either the canonical natural exponential family or the more commonly employed polar coordinate parameterizations. We analyze when standard conjugate priors as well as posteriors are proper, and investigate the Jeffreys prior for the von Mises-Fisher family. Finally, we characterize the proper distributions in the standard conjugate family of the (matrix-valued) von Mises-Fisher distributions on Stiefel manifolds.
 
Article
A generalized self-consistency approach to maximum likelihood estimation (MLE) and model building was developed in (Tsodikov, 2003) and applied to a survival analysis problem. We extend the framework to obtain second-order results such as information matrix and properties of the variance. Multinomial model motivates the paper and is used throughout as an example. Computational challenges with the multinomial likelihood motivated Baker (1994) to develop the Multinomial-Poisson (MP) transformation for a large variety of regression models with multinomial likelihood kernel. Multinomial regression is transformed into a Poisson regression at the cost of augmenting model parameters and restricting the problem to discrete covariates. Imposing normalization restrictions by means of Lagrange multipliers (Lang, 1996) justifies the approach. Using the self-consistency framework we develop an alternative solution to multinomial model fitting that does not require augmenting parameters while allowing for a Poisson likelihood and arbitrary covariate structures. Normalization restrictions are imposed by averaging over artificial "missing data" (fake mixture). Lack of probabilistic interpretation at the "complete-data" level makes the use of the generalized self-consistency machinery essential.
 
Number of response categories vs. time used per twenty estimation procedures (dashed line: multinomial logit model; solid line: R − 1 Poisson models with prespecified artificial variables)  
Article
The computation in the multinomial logit mixed effects model is costly especially when the response variable has a large number of categories, since it involves high-dimensional integration and maximization. Tsodikov and Chefo (2008) developed a stable MLE approach to problems with independent observations, based on generalized self-consistency and quasi-EM algorithm developed in Tsodikov (2003). In this paper, we apply the idea to clustered multinomial response to simplify the maximization step. The method transforms the complex multinomial likelihood to Poisson-type likelihood and hence allows for the estimates to be obtained iteratively solving a set of independent low-dimensional problems. The methodology is applied to real data and studied by simulations. While maximization is simplified, numerical integration remains the dominant challenge to computational efficiency.
 
Article
This paper reviews two types of geometric methods proposed in recent years for defining statistical decision rules based on 2-dimensional parameters that characterize treatment effect in a medical setting. A common example is that of making decisions, such as comparing treatments or selecting a best dose, based on both the probability of efficacy and the probability toxicity. In most applications, the 2-dimensional parameter is defined in terms of a model parameter of higher dimension including effects of treatment and possibly covariates. Each method uses a geometric construct in the 2-dimensional parameter space based on a set of elicited parameter pairs as a basis for defining decision rules. The first construct is a family of contours that partitions the parameter space, with the contours constructed so that all parameter pairs on a given contour are equally desirable. The partition is used to define statistical decision rules that discriminate between parameter pairs in term of their desirabilities. The second construct is a convex 2-dimensional set of desirable parameter pairs, with decisions based on posterior probabilities of this set for given combinations of treatments and covariates under a Bayesian formulation. A general framework for all of these methods is provided, and each method is illustrated by one or more applications.
 
Article
This paper considers 2×2 tables arising from case-control studies in which the binary exposure may be misclassified. We found circumstances under which the inverse matrix method provides a more efficient odds ratio estimator than the naive estimator. We provide some intuition for the findings, and also provide a formula for obtaining the minimum size of a validation study such that the variance of the odds ratio estimator from the inverse matrix method is smaller than that of the naive estimator, thereby ensuring an advantage for the misclassification corrected result. As a corollary of this result, we show that correcting for misclassification does not necessarily lead to a widening of the confidence intervals, but, rather, in addition to producing a consistent estimate, can also produce one that is more efficient.
 
Article
Many recent applications of nonparametric Bayesian inference use random partition models, i.e. probability models for clustering a set of experimental units. We review the popular basic constructions. We then focus on an interesting extension of such models. In many applications covariates are available that could be used to a priori inform the clustering. This leads to random clustering models indexed by covariates, i.e., regression models with the outcome being a partition of the experimental units. We discuss some alternative approaches that have been used in the recent literature to implement such models, with an emphasis on a recently proposed extension of product partition models. Several of the reviewed approaches were not originally intended as covariate-based random partition models, but can be used for such inference.
 
Article
The mixed-effects models with two variance components are often used to analyze longitudinal data. For these models, we compare two approaches to estimating the variance components, the analysis of variance approach and the spectral decomposition approach. We establish a necessary and sufficient condition for the two approaches to yield identical estimates, and some sufficient conditions for the superiority of one approach over the other, under the mean squared error criterion. Applications of the methods to circular models and longitudinal data are discussed. Furthermore, simulation results indicate that better estimates of variance components do not necessarily imply higher power of the tests or shorter confidence intervals.
 
Article
Spatial modeling is typically composed of a specification of a mean function and a model for the correlation structure. A common assumption on the spatial correlation is that it is isotropic. This means that the correlation between any two observations depends only on the distance between those sites and not on their relative orientation. The assumption of isotropy is often made due to a simpler interpretation of correlation behavior and to an easier estimation problem under an assumed isotropy. The assumption of isotropy, however, can have serious deleterious effects when not appropriate. In this paper we formulate a test of isotropy for spatial observations located according to a general class of stochastic designs. Distribution theory of our test statistic is derived and we carry out extensive simulations which verify the efficacy of our approach. We apply our methodology to a data set on longleaf pine trees from an oldgrowth forest in the southern United States.
 
Article
This work focuses on the estimation of distribution functions with incomplete data, where the variable of interest Y has ignorable missingness but the covariate X is always observed. When X is high dimensional, parametric approaches to incorporate X - information is encumbered by the risk of model misspecification and nonparametric approaches by the curse of dimensionality. We propose a semiparametric approach, which is developed under a nonparametric kernel regression framework, but with a parametric working index to condense the high dimensional X - information for reduced dimension. This kernel dimension reduction estimator has double robustness to model misspecification and is most efficient if the working index adequately conveys the X - information about the distribution of Y. Numerical studies indicate better performance of the semiparametric estimator over its parametric and nonparametric counterparts. We apply the kernel dimension reduction estimation to an HIV study for the effect of antiretroviral therapy on HIV virologic suppression.
 
Article
This paper introduces a nonparametric approach for testing the equality of two or more survival distributions based on right censored failure times with missing population marks for the censored observations. The standard log-rank test is not applicable here because the population membership information is not available for the right censored individuals. We propose to use the imputed population marks for the censored observations leading to fractional at-risk sets that can be used in a two sample censored data log-rank test. We demonstrate with a simple example that there could be a gain in power by imputing population marks (the proposed method) for the right censored individuals compared to simply removing them (which also would maintain the right size). Performance of the imputed log-rank tests obtained this way is studied through simulation. We also obtain an asymptotic linear representation of our test statistic. Our testing methodology is illustrated using a real data set.
 
Article
Dose-finding in clinical studies is typically formulated as a quantile estimation problem, for which a correct specification of the variance function of the outcomes is important. This is especially true for sequential study where the variance assumption directly involves in the generation of the design points and hence sensitivity analysis may not be performed after the data are collected. In this light, there is a strong reason for avoiding parametric assumptions on the variance function, although this may incur efficiency loss. In this article, we investigate how much information one may retrieve by making additional parametric assumptions on the variance in the context of a sequential least squares recursion. By asymptotic comparison, we demonstrate that assuming homoscedasticity achieves only a modest efficiency gain when compared to nonparametric variance estimation: when homoscedasticity in truth holds, the latter is at worst 88% as efficient as the former in the limiting case, and often achieves well over 90% efficiency for most practical situations. Extensive simulation studies concur with this observation under a wide range of scenarios.
 
Article
We propose an efficient group sequential monitoring rule for clinical trials. At each interim analysis both efficacy and futility are evaluated through a specified loss structure together with the predicted power. The proposed design is robust to a wide range of priors, and achieves the specified power with a saving of sample size compared to existing adaptive designs. A method is also proposed to obtain a reduced-bias estimator of treatment difference for the proposed design. The new approaches hold great potential for efficiently selecting a more effective treatment in comparative trials. Operating characteristics are evaluated and compared with other group sequential designs in empirical studies. An example is provided to illustrate the application of the method.
 
Article
The Wilcoxon rank-sum test and its variants are historically well-known to be very powerful nonparametric decision rules for testing no location difference between two groups given paired data versus a shift alternative. In this article, we propose a new alternative empirical likelihood (EL) ratio approach for testing the equality of marginal distributions given that sampling is from a continuous bivariate population. We show that in various shift alternative scenarios the proposed exact test is superior to the classic nonparametric procedures, which may break down completely or are frequently inferior to the density-based EL ratio test. This is particularly true in the cases where there is a non-constant shift under the alternative or the data distributions are skewed. An extensive Monte Carlo study shows that the proposed test has excellent operating characteristics. We apply the density-based EL ratio test to analyze real data from two medical studies.
 
Article
Density function is a fundamental concept in data analysis. Nonparametric methods including kernel smoothing estimate are available if the data is completely observed. However, in studies such as diagnostic studies following a two-stage design the membership of some of the subjects may be missing. Simply ignoring those subjects with unknown membership is valid only in the MCAR situation. In this paper, we consider kernel smoothing estimate of the density functions, using the inverse probability approaches to address the missing values. We illustrate the approaches with simulation studies and real study data in mental health.
 
Article
This article deals with studies that monitor occurrences of a recurrent event for n subjects or experimental units. It is assumed that the i(th) unit is monitored over a random period [0,tau(i)]. The successive inter-event times T(i1), T(i2), ..., are assumed independent of tau(i). The random number of event occurrences over the monitoring period is K(i) = max{k in {0, 1, 2, ...} : T(i1) + T(i2) + ... + T(ik) </= tau(i)}. The T(ij)s are assumed to be i.i.d. from an unknown distribution function F which belongs to a parametric family of distributions C={F(;theta):thetain subsetRep}. The tau(i)s are assumed to be i.i.d. from unknown distribution function G. The problem of estimating theta, and consequently the distribution F, is considered under the assumption that the tau(i)s are informative about the inter-event distribution. Specifically, 1 - G = (1 - F)(beta) for some unknown beta > 0, a generalized Koziol-Green (cf., Koziol and Green (1976); Chen, Hollander, and Langberg (1982)) model. Asymptotic properties of estimators of theta, beta, and F are presented. Efficiencies of estimators of theta and F are ascertained relative to estimators which ignores the informative monitoring aspect. These comparisons reveal the gain in efficiency when the informative structure of the model is exploited. Concrete demonstrations were performed for F exponential and a two-parameter Weibull.
 
Article
The theoretical literature on quantile and distribution function estimation in infinite populations is very rich, and invariance plays an important role in these studies. This is not the case for the commonly occurring problem of estimation of quantiles in finite populations. The latter is more complicated and interesting because an optimal strategy consists not only of an estimator, but also of a sampling design, and the estimator may depend on the design and on the labels of sampled individuals, whereas in iid sampling, design issues and labels do not exist.We study estimation of finite population quantiles, with emphasis on estimators that are invariant under the group of monotone transformations of the data, and suitable invariant loss functions. Invariance under the finite group of permutation of the sample is also considered. We discuss nonrandomized and randomized estimators, best invariant and minimax estimators, and sampling strategies relative to different classes. Invariant loss functions and estimators in finite population sampling have a nonparametric flavor, and various natural combinatorial questions and tools arise as a result.
 
Article
Most statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. We call a pathway as a predefined set of genes that serve a particular cellular or physiological function. Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. A semiparametric regression approach for identifying pathways related to a continuous outcome was proposed by Liu et al. (2007), who demonstrated the connection between a least squares kernel machine for nonparametric pathway effect and a restricted maximum likelihood (REML) for variance components. However, the asymptotic properties on a semiparametric regression for identifying pathway have never been studied. In this paper, we study the asymptotic properties of the parameter estimates on semiparametric regression and compare Liu et al.'s REML with our REML obtained from a profile likelihood. We prove that both approaches provide consistent estimators, have [Formula: see text] convergence rate under regularity conditions, and have either an asymptotically normal distribution or a mixture of normal distributions. However, the estimators based on our REML obtained from a profile likelihood have a theoretically smaller mean squared error than those of Liu et al.'s REML. Simulation study supports this theoretical result. A profile restricted likelihood ratio test is also provided for the non-standard testing problem. We apply our approach to a type II diabetes data set (Mootha et al., 2003).
 
Top-cited authors
Hidetoshi Shimodaira
  • Kyoto University
Narayanaswamy Balakrishnan
  • McMaster University
Leslie M Moore
  • Los Alamos National Laboratory
Don Ylvisaker
  • University of California, Los Angeles
Debasis Kundu
  • Indian Institute of Technology Kanpur