Article

Optimal estimation in surrogate outcome regression problems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The authors consider a double robust estimation of the regression parameter defined by an estimating equation in a surrogate outcome set-up. Under a correct specification of the propensity score, the proposed estimator has smallest trace of asymptotic covariance matrix whether the “working outcome regression model” involved is specified correct or not, and it is particularly meaningful when it is incorrectly specified. Simulations are conducted to examine the finite sample performance of the proposed procedure. Data on obesity and high blood pressure are analyzed for illustration. The Canadian Journal of Statistics 38: 633–646; 2010 © 2010 Statistical Society of CanadaLes auteurs considèrent l'estimation doublement robuste de paramètres de régression obtenue l'aide d'une équation d'estimation dans un contexte présentant des variables de substitution. Sous une bonne spécification de la cote de propension, l'estimateur proposé possède la trace de la matrice de variance-covariance la plus petite et ce, que le ¡¡ modèle de régression utilisé ¿¿ soit correctement spécifié ou non. Ceci est particulièrement signifiant lorsque le modèle est incorrectement spécifié. Des simulations sont faites afin d'étudier la performance de la procédure proposée pour de petits échantillons. Pour illustrer cette procédure, des données sur l'obésité et l'hypertension artérielle sont analysées. La revue canadienne de statistique 38: 633–646; 2010 © 2010 Société statistique du Canada

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Despite the appealing DR property, DR estimators can exhibit severe bias when both the working regression model and the missing data model are 'slightly' misspecified [18]. To overcome this shortcoming, Cao, Tsiatis, and Davidian [23], Duan, Qin, and Wang [24], and Tsiatis, Davidian, and Cao [25] proposed DR estimators that improve the robustness. For a given parametric working regression model, Cao, Tsiatis, and Davidian [23] studied a DR estimator that minimizes the parameters in the working regression model such that the population mean estimation has minimum variance. ...
... If one is interested in estimating regression parameters instead of estimating the population mean, then the asymptotic variance derived from DR estimates is a matrix. Duan, Qin, and Wang [24] choose the minimum trace as the optimal criterion where they only considered the cross-sectional studies. In this paper, we extend the idea of minimum trace of the asymptotic covariance matrix criterion to the incomplete longitudinal data with a surrogate process. ...
... A typical method is to use the ordinary least square (OLS) estimate with the available data, which is equivalent to solving the following estimating equation (7) where 1i = diag(R ij , j = 1, …, J). Following Cao, Tsiatis, and Davidian [23] and Duan, Qin, and Wang [24], we can easily argue that the solution of equation (6) with replaced by the estimate produced from (7) is consistent if the missing data model is correctly specified but not optimal when is incorrectly specified. In the following, we will develop optimal estimating equations for in the sense of having minimum trace of asymptotic covariance matrix of . ...
Article
Missing data is a very common problem in medical and social studies, especially when data are collected longitudinally. It is a challenging problem to utilize observed data effectively. Many papers on missing data problems can be found in statistical literature. It is well known that the inverse weighted estimation is neither efficient nor robust. On the other hand, the doubly robust (DR) method can improve the efficiency and robustness. As is known, the DR estimation requires a missing data model (i.e., a model for the probability that data are observed) and a working regression model (i.e., a model for the outcome variable given covariates and surrogate variables). Because the DR estimating function has mean zero for any parameters in the working regression model when the missing data model is correctly specified, in this paper, we derive a formula for the estimator of the parameters of the working regression model that yields the optimally efficient estimator of the marginal mean model (the parameters of interest) when the missing data model is correctly specified. Furthermore, the proposed method also inherits the DR property. Simulation studies demonstrate the greater efficiency of the proposed method compared with the standard DR method. A longitudinal dementia data set is used for illustration. Copyright © 2013 John Wiley & Sons, Ltd.
... The resulting DR estimate is robust to misspecification of either the conditional mean model or the missing data model but not both. The comprehensive literature for the doubly robust estimator includes Lunceford and Davidian (2004), Carpenter, Kenward, and Vansteelandt (2006), Davidian, Tsiatis, and Leon (2005), and Kang and Schafer (2007), Cao, Tsiatis and Davidian (2009), Duan, Qin and Wang (2010), Tsiatis, Davidian and Cao (2010), Van der Laan and Robins (2003), Seaman and Copas (2009), Chen and Zhou (2011), among others. ...
Article
In statistical inference, one has to make sure that the underlying regression model is correctly specified otherwise the resulting estimation may be biased. Model checking is an important method to detect any departure of the regression model from the true one. Missing data are a ubiquitous problem in social and medical studies. If the underlying regression model is correctly specified, recent researches show great popularity of the doubly robust (DR) estimates method for handling missing data because of its robustness to the misspecification of either the missing data model or the conditional mean model, that is, the model for the conditional expectation of true regression model conditioning on the observed quantities. However, little work has been devoted to the goodness of fit test for DR estimates method. In this article, we propose a testing method to assess the reliability of the estimator derived from the DR estimating equation with possibly missing response and always observed auxiliary variables. Numerical studies demonstrate that the proposed test can control type I errors well. Furthermore the proposed method can detect departures from model assumptions in the marginal mean model of interest powerfully. A real dementia data set is used to illustrate the method for the diagnosis of model misspecification in the problem of missing response with an always observed auxiliary variable for cross-sectional data.
Article
Full-text available
0W(t)), where 0(t) is an unspecied baseline hazard function, W(t) = w(t, V(t)), w(; ) is a known function that maps (t, V(t)) to Rq, and0 is a q 1 unknown parameter vector. When 0 6 =0 , then drop-out is nonignorable. On account of identiability problems, joint estimation of the mean 0 of Y and the selection bias parameter 0 may be dicult or impossible. Therefore, we propose regarding the selection bias parameter 0 as known, rather than estimating it from the data. We then perform a sensitivity analysis to see how inference about 0 changes as we vary 0 over a plausible range of values. We apply our approach to the analysis of ACTG 175, an AIDS clinical trial.
Article
Full-text available
The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two- dimensional plot.
Article
Full-text available
In applied problems it is common to specify a model for the conditional mean of a response given a set of regressors. A subset of the regressors may be missing for some study subjects either by design or happenstance. In this article we propose a new class of semiparametric estimators, based on inverse probability weighted estimating equations, that are consistent for parameter vector α0 of the conditional mean model when the data are missing at random in the sense of Rubin and the missingness probabilities are either known or can be parametrically modeled. We show that the asymptotic variance of the optimal estimator in our class attains the semiparametric variance bound for the model by first showing that our estimation problem is a special case of the general problem of parameter estimation in an arbitrary semiparametric model in which the data are missing at random and the probability of observing complete data is bounded away from 0, and then deriving a representation for the efficient score, the semiparametric variance bound, and the influence function of any regular, asymptotically linear estimator in this more general estimation problem. Because the optimal estimator depends on the unknown probability law generating the data, we propose locally and globally adaptive semiparametric efficient estimators. We compare estimators in our class with previously proposed estimators. We show that each previous estimator is asymptotically equivalent to some, usually inefficient, estimator in our class. This equivalence is a consequence of a proposition stating that every regular asymptotic linear estimator of α0 is asymptotically equivalent to some estimator in our class. We compare various estimators in a small simulation study and offer some practical recommendations.
Article
Full-text available
Considerable recent interest has focused on doubly robust estimators for a population mean response in the presence of incomplete data, which involve models for both the propensity score and the regression of outcome on covariates. The usual doubly robust estimator may yield severely biased inferences if neither of these models is correctly specified and can exhibit nonnegligible bias if the estimated propensity score is close to zero for some observations. We propose alternative doubly robust estimators that achieve comparable or improved performance relative to existing methods, even with some estimated propensity scores close to zero.
Article
SUMMARY In the context of estimating β from the regression model Pβ(Y|X), relating response Y to covariates X, suppose that only a surrogate response S is available for most study subjects. Suppose that for a random subsample of the study cohort, termed the validation sample, the true outcome Y is available in addition to S. We consider maximum likelihood estimation of β from such data and show that it is nonrobust to misspecification of the distribution relating the surrogate to the true outcome, P(S|Y, X). An alternative semi-parametric method is also considered, which is nonparametric with respect to P(S|Y, X), Large-sample distribution theory for maximum estimated likelihood estimates is developed. An illustrative example is presented.
Article
This paper presents a general technique for the treatment of samples drawn without replacement from finite universes when unequal selection probabilities are used. Two sampling schemes are discussed in connection with the problem of determining optimum selection probabilities according to the information available in a supplementary variable. Admittedly, these two schemes have limited application. They should prove useful, however, for the first stage of sampling with multi-stage designs, since both permit unbiased estimation of the sampling variance without resorting to additional assumptions.* Journal Paper No. J2139 of the Iowa Agricultural Experiment Station, Ames, Iowa, Project 1005. Presented to the Institute of Mathematical Statistics, March 17, 1951.
Article
A surrogate endpoint in a cardiovascular clinical trial is defined as endpoint measured in lieu of some other so-called ‘true’ endpoint. A surrogate is especially useful if it is easily measured and highly correlated with the true endpoint. Often the ‘true’ endpoint is one with clinical importance to the patient, for example, mortality or a major clinical outcome, while a surrogate is one biologically closer to the process of disease, for example, ejection fraction. Use of the surrogate can often lead to dramatic reductions in sample size and much shorter studies than use of the true endpoint. We discuss several problems common in trials with surrogate endpoints. Most important is the effect of missing data, especially in the face of informative censoring. Possible solutions are the assignment of scores or formal penalties to missing data.
Article
Investigators use a surrogate endpoint when the endpoint of interest is too difficult and/or expensive to measure routinely and when they can define some other, more readily measurable, endpoint, that is sufficiently well correlated with the first to justify its use as a substitute. A surrogate endpoint is usually proposed on the basis of a biologic rationale. In cancer studies with survival time as the primary endpoint, surrogate endpoints frequently employed are tumour response, time to progression, or time to reappearance of disease, since these events occur earlier and are unaffected by use of secondary therapies. In early drug development studies, tumour response is often the true primary endpoint. We discuss the investigation of the validity of carcinoembryonic antigen (a tumour marker present in the blood) as a surrogate for tumour response. In considering the validity of surrogate endpoints, one must distinguish between study endpoints that provide a basis for reliable comparisons of therapeutic effect, and clinical endpoints that are useful for patient management but have insufficient sensitivity and/or specificity to provide reproducible assessments of the effects of particular therapies.
Article
The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two-dimensional plot.
Article
We consider estimation for regression analysis with surrogate or auxiliary outcome data. Assume that the regression model for the conditional mean of the outcome is a known function of a linear combination of the covariates with unknown coefficients, which are the regression parameters of interest. Such a class of models includes the generalised linear models as special cases. Suppose further that the outcome variable of interest is only observed in a validation subset, which is a simple random subsample from the whole sample, and that data on covariates as well as on one or more easily measured but less accurate surrogate outcome variables is collected from the whole sample. We propose a robust imputation approach which replaces the unobserved value of the outcome by its 'predicted' value generated from a specified 'working' parametric model. Estimation of the regression parameters is conducted as if the outcome data were completely observed. The resulting estimator of the regression parameter is consistent even if the 'working model' is misspecified. Large and finite sample properties for the proposed estimator are investigated.
Article
Surrogate endpoints have been defined by Prentice as response variables that can substitute for a 'true' endpoint for the purpose of comparing specific interventions or treatments in a clinical trial. The applicability of this definition, and of related surrogate endpoint criteria, is discussed, with emphasis on cancer and AIDS research settings. Auxiliary endpoints are defined as response variables, or covariates, that can strengthen true endpoint analyses. Specifically, such response variables provide some additional information on true endpoint occurrence times for study subjects having censored values for such times. Auxiliary variables will very frequently be available, and they may be able to be used without making additional strong assumptions. Approaches to the use of auxiliary variables using ideas based on augmented score and augmented likelihood methods are described.
Article
CD4 lymphocyte and survival data from two completed trials, a double-blind placebo-controlled trial of zidovudine in patients with advanced human immunodeficiency virus type 1 (HIV) disease (BW-02 study) and a randomized trial of two different doses of zidovudine in patients with advanced HIV disease (ACTG-002 study) were used to determine the degree to which CD4 lymphocyte counts reflect zidovudine-associated survival benefit. Proportional hazards models were used, and CD4 lymphocyte counts were smoothed by using empirical Bayes estimates. The geometric mean of the CD4 lymphocyte counts increased by 71 and 46 cells/mm3 for patients in the BW-02 and ACTG-002 studies, respectively, followed by a progressive decline. Higher pretreatment CD4 lymphocyte counts (p = 0.001), greater increases in CD4 lymphocytes at 8 weeks (p = 0.1), and smaller declines in the slope (p = 0.001) were associated with a lower risk of death. The most current CD4 lymphocyte count was most prognostic of death (p = 0.001). The risk of death was greater for patients with lower CD4 lymphocytes and this risk increased sharply when the CD4 lymphocyte counts fell below 50 cells/mm3. The hazard of death was higher for placebo recipients at all levels of CD4 lymphocytes compared with zidovudine recipients. Although higher CD4 lymphocyte counts are associated with improved survival, these increases account for only a small proportion of the survival benefit of zidovudine in these two studies.
Article
Phase 3 clinical trials, which evaluate the effect that new interventions have on the clinical outcomes of particular relevance to the patient (such as death, loss of vision, or other major symptomatic event), often require many participants to be followed for a long time. There has recently been great interest in using surrogate end points, such as tumor shrinkage or changes in cholesterol level, blood pressure, CD4 cell count, or other laboratory measures, to reduce the cost and duration of clinical trials. In theory, for a surrogate end point to be an effective substitute for the clinical outcome, effects of the intervention on the surrogate must reliably predict the overall effect on the clinical outcome. In practice, this requirement frequently fails. Among several explanations for this failure is the possibility that the disease process could affect the clinical outcome through several causal pathways that are not mediated through the surrogate, with the intervention's effect on these pathways differing from its effect on the surrogate. Even more likely, the intervention might also affect the clinical outcome by unintended, unanticipated, and unrecognized mechanisms of action that operate independently of the disease process. We use examples from several disease areas to illustrate how surrogate end points have been misleading about the actual effects that treatments have on the health of patients. Surrogate end points can be useful in phase 2 screening trials for identifying whether a new intervention is biologically active and for guiding decisions about whether the intervention is promising enough to justify a large definitive trial with clinically meaningful outcomes. In definitive phase 3 trials, except for rare circumstances in which the validity of the surrogate end point has already been rigorously established, the primary end point should be the true clinical outcome.
Article
The validation of surrogate endpoints has been studied by Prentice (1989, Statistics in Medicine 8, 431-440) and Freedman, Graubard, and Schatzkin (1992, Statistics in Medicine 11, 167-178). We extended their proposals in the cases where the surrogate and the final endpoints are both binary or normally distributed. Letting T and S be random variables that denote the true and surrogate endpoint, respectively, and Z be an indicator variable for treatment, Prentice's criteria are fulfilled if Z has a significant effect on T and on S, if S has a significant effect on T, and if Z has no effect on T given S. Freedman relaxed the latter criterion by estimating PE, the proportion of the effect of Z on T that is explained by S, and by requiring that the lower confidence limit of PE be larger than some proportion, say 0.5 or 0.75. This condition can only be verified if the treatment has a massively significant effect on the true endpoint, a rare situation. We argue that two other quantities must be considered in the validation of a surrogate endpoint: RE, the effect of Z on T relative to that of Z on S, and gamma Z, the association between S and T after adjustment for Z. A surrogate is said to be perfect at the individual level when there is a perfect association between the surrogate and the final endpoint after adjustment for treatment. A surrogate is said to be perfect at the population level if RE is 1. A perfect surrogate fulfills both conditions, in which case S and T are identical up to a deterministic transformation. Fieller's theorem is used for the estimation of PE, RE, and their respective confidence intervals. Logistic regression models and the global odds ratio model studied by Dale (1986, Biometrics, 42, 909-917) are used for binary endpoints. Linear models are employed for continuous endpoints. In order to be of practical value, the validation of surrogate endpoints is shown to require large numbers of observations.
Article
The use of surrogate end points has become increasingly common in medical and biological research. This is primarily because, in many studies, the primary end point of interest is too expensive or too difficult to obtain. There is now a large volume of statistical methods for analysing studies with surrogate end point data. However, to our knowledge, there has not been a comprehensive review of these methods to date. This paper reviews some existing methods and summarizes the strengths and weaknesses of each method. It also discusses the assumptions that are made by each method and critiques how likely these assumptions are met in practice.
Empirical Processes in M-Estimation Surrogate endpoints in clinical trials: Cardiovascular disease
  • S A Van
  • Geer
S. A. van der Geer (2000), " Empirical Processes in M-Estimation, " Cambridge University Press, New York. J. Wittes, E. Lakatos & J. Probstfield (1989). Surrogate endpoints in clinical trials: Cardiovascular disease. Stat Med, 8, 415–425.
  • Little
Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion)
  • Scharfstein