Stata Journal

Published by StataCorp
Print ISSN: 1536-867X
Classification accuracy is the ability of a marker or diagnostic test to discriminate between two groups of individuals, cases and controls, and is commonly summarized using the receiver operating characteristic (ROC) curve. In studies of classification accuracy, there are often covariates that should be incorporated into the ROC analysis. We describe three different ways of using covariate information. For factors that affect marker observations among controls, we present a method for covariate adjustment. For factors that affect discrimination (i.e. the ROC curve), we describe methods for modelling the ROC curve as a function of covariates. Finally, for factors that contribute to discrimination, we propose combining the marker and covariate information, and ask how much discriminatory accuracy improves with the addition of the marker to the covariates (incremental value). These methods follow naturally when representing the ROC curve as a summary of the distribution of case marker observations, standardized with respect to the control distribution.
Lowess and straight-line fit for the association between nurse-reported BMI and age of onset of smoking
Lowess and straight-line fit for the association between self-reported BMI and age of onset of smoking
Multiple-source data are often collected to provide better information of some underlying construct that is difficult to measure or likely to be missing. In this article, we describe regression-based methods for analyzing multiple-source data in Stata. We use data from the BROMS Cohort Study, a cohort of Swedish adolescents who collected data on body mass index that was self-reported and that was measured by nurses. We draw together into a single frame of reference both source reports and relate these to smoking onset. This unified method has two advantages over traditional approaches: 1) the relative predictiveness of each source can be assessed and 2) all subjects contribute to the analysis. The methods are applicable to other areas of epidemiology where multiple-source reports are used.
Clustered data arise in many settings, particularly within the social and biomedical sciences. As an example, multiple-source reports are commonly collected in child and adolescent psychiatric epidemiologic studies where researchers use various informants (e.g. parent and adolescent) to provide a holistic view of a subject's symptomatology. Fitzmaurice et al. (1995) have described estimation of multiple source models using a standard generalized estimating equation (GEE) framework. However, these studies often have missing data due to additional stages of consent and assent required. The usual GEE is unbiased when missingness is Missing Completely at Random (MCAR) in the sense of Little and Rubin (2002). This is a strong assumption that may not be tenable. Other options such as weighted generalized estimating equations (WEEs) are computationally challenging when missingness is non-monotone. Multiple imputation is an attractive method to fit incomplete data models while only requiring the less restrictive Missing at Random (MAR) assumption. Previously estimation of partially observed clustered data was computationally challenging however recent developments in Stata have facilitated their use in practice. We demonstrate how to utilize multiple imputation in conjunction with a GEE to investigate the prevalence of disordered eating symptoms in adolescents reported by parents and adolescents as well as factors associated with concordance and prevalence. The methods are motivated by the Avon Longitudinal Study of Parents and their Children (ALSPAC), a cohort study that enrolled more than 14,000 pregnant mothers in 1991-92 and has followed the health and development of their children at regular intervals. While point estimates were fairly similar to the GEE under MCAR, the MAR model had smaller standard errors, while requiring less stringent assumptions regarding missingness.
Description of variables in simulated data 
Lengths of follow-up for individuals in longitudinal clinical data (gray indicates the time point is included in the individuals' follow-up) 
Electronic health records of longitudinal clinical data are a valuable resource for health care research. One obstacle of using databases of health records in epidemiological analyses is that general practitioners mainly record data if they are clinically relevant. We can use existing methods to handle missing data, such as multiple imputation (mi), if we treat the unavailability of measurements as a missing-data problem. Most software implementations of MI do not take account of the longitudinal and dynamic structure of the data and are difficult to implement in large databases with millions of individuals and long follow-up. Nevalainen, Kenward, and Virtanen (2009, Statistics in Medicine 28: 3657-3669) proposed the two-fold fully conditional specification algorithm to impute missing data in longitudinal data. It imputes missing values at a given time point, conditional on information at the same time point and immediately adjacent time points. In this article, we describe a new command, twofold, that implements the two-fold fully conditional specification algorithm. It is extended to accommodate MI of longitudinal clinical records in large databases.
This article considers the situation that arises when a survey data producer has collected data from a sample with a complex design (possibly featuring stratification of the population, cluster sampling, and / or unequal probabilities of selection), and for various reasons only provides secondary analysts of those survey data with a final survey weight for each respondent and "average" design effects for survey estimates computed from the data. In general, these "average" design effects, presumably computed by the data producer in a way that fully accounts for all of the complex sampling features, already incorporate possible increases in sampling variance due to the use of the survey weights in estimation. The secondary analyst of the survey data who then 1) uses the provided information to compute weighted estimates, 2) computes design-based standard errors reflecting variance in the weights (using Taylor Series Linearization, for example), and 3) inflates the estimated variances using the "average" design effects provided is applying a "double" adjustment to the standard errors for the effect of weighting on the variance estimates, leading to overly conservative inferences. We propose a simple method for preventing this problem, and provide a Stata program for applying appropriate adjustments to variance estimates in this situation. We illustrate two applications of the method to survey data from the Monitoring the Future (MTF) study, and conclude with suggested directions for future research in this area.
The Skillings-Mack statistic (Skillings and Mack, 1981, Technometrics 23: 171–177) is a general Friedman-type statistic that can be used in almost any block design with an arbitrary missing-data structure. The missing data can be either missing by design, for example, an incomplete block design, or missing completely at random. The Skillings–Mack test is equivalent to the Friedman test when there are no missing data in a balanced complete block design, and the Skillings–Mack test is equivalent to the test suggested in Durbin (1951, British Journal of Psychology, Statistical Section 4: 85–90) for a balanced incomplete block design. The Friedman test was implemented in Stata by Goldstein (1991, Stata Technical Bulletin 3: 26–27) and further developed in Goldstein (2005, Stata Journal 5: 285). This article introduces the skilmack command, which performs the Skillings–Mack test. The skilmack command is also useful when there are many ties or equal ranks (N.B. the Friedman statistic compared with the x ² distribution will give a conservative result), as well as for small samples; appropriate results can be obtained by simulating the distribution of the test statistic under the null hypothesis.
This article discusses a method by Erikson et al. (2005) for decomposing a total effect in a logit model into direct and indirect effects. Moreover, this article extends this method in three ways. First, in the original method the variable through which the indirect effect occurs is assumed to be normally distributed. In this article the method is generalized by allowing this variable to have any distribution. Second, the original method did not provide standard errors for the estimates. In this article the bootstrap is proposed as a method of providing those. Third, I show how to include control variables in this decomposition, which was not allowed in the original method. The original method and these extensions are implemented in the ldecomp package.
This paper describes how to write Stata programs to estimate the power of virtually any statistical test that Stata can perform. Examples given include the t test, Poisson regression, Cox regression, and the nonparametric rank-sum test. Copyright 2002 by Stata Corporation.
This article introduces a new Stata command, labcenswdi, to automatically manage databases that provide variable descriptions on the second row in a dataset. While renaming all variables and converting them from string to numeric, labcenswdi automatically manages the variable descriptions including removing them from the second row to place them into Stata variable labels and saving them to a text file. The process yields a dataset ready for statistical analysis. I illustrate how this command can be used to efficiently manage datasets obtained from the U.S. Census 2000 and the World Development Indicators databases. Copyright 2011 by StataCorp LP.
The uniform() function generates random draws from a uniform distribution be- tween zero and one ((D) functions). One of its many uses is creating random draws from a discrete distribution where each possible value has a known probability. A uniform distribution means that each number between zero and one is equally likely to be drawn. So the probability that a random draw from a uniform distribution has a value less than .50 is 50%, the probability that such a random draw has a value less than .60 is 60%, etc. The example below shows how this can be used to create a random variable, where the probability of drawing a 1 is 60% and the probability of drawing a 0 40%. In the first line random draws from the uniform distribution are stored in the variable rand. Each case has a 60% probability of getting a value of rand that is less than .60 and a 40% probability that it receives a value more than .60. The second line uses this fact to create draws from the desired distribution. Using the cond() function (Kantor and Cox 2005) it creates a new variable, draw, which has the value 1 if rand is less than .6 and 0 if rand has a value more than .60.
Top-cited authors
Patrick Royston
  • University College London
David Roodman
  • Bill & Melinda Gates Foundation
Christopher Baum
  • Boston College, Chestnut Hill MA, USA
David Roodman
Mark E. Schaffer
  • Heriot-Watt University