Article

Eelworms, Bullet Holes, and Geraldine Ferraro: Some Problems With Statistical Adjustment and Some Solutions

Authors:
  • Independent Statistician and Author
To read the full-text of this research, you can request a copy directly from the author.

Abstract

There is no safety in numbers. When data are gathered from a sample in which the selection criteria are unknown, many problems can befall the unwary investigator. In this paper we explore some of these problems and discuss some solutions. Our principle example is drawn from data from students who choose to take the College Board's Scholastic Aptitude Test (SAT). We explore methods of covariance adjustment as well as more explicitly model-based adjustment methods. Among the latter we discuss Heckman's Selection Model, Rubin's Mixture Model, and Tukey's Simplified Selection Model.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Wainer (1989) cited Tukey's comment that statisticians are like lawyers (Wainer, 1986). Some lawyers tell you, "Don't do it!" ...
... The purpose of this article is to argue that scientists should examine any statistical analysis in the context of a set of rival theories. Our support of modeling is in partial agreement with Wainer's (1989) paper; however, we want to push modeling a step further and emphasize that theories often warn us to side with the type of lawyer who explains why certain actions are improper and should not be done. Simple theories help us understand why better schools might produce lower average Scholastic Aptitude Test (SAT) scores, why a pesticide might be a good idea even if crop yields are equal in treated and untreated fields when the number of eelworms is partialed out, and why-in the absence of bias-men might have higher salaries than women with the same merit and have higher merit than women who receive the same salaries. ...
... However, regression analysis has the property that the lower the correlation between merit and salary, the easier it is to falsely conclude that there is sex bias when b = 0, if regression adjustments are used. Wainer (1989) reviewed the eelworm study as a classic example of misuse of statistical adjustment. The mediated model of eelworms (substituting measured eelworms for measured merit, fumigation for sex, and crop yield for salary) shows why the effect of treatment with eelworms partialed out would not be an estimate of the treatment effect. ...
Article
Full-text available
This paper addresses the interpretation of data that are contaminated by self-selected samples and/or lack of experimental control. Wainer (1989) reviewed different methods for treating self-selected samples and concluded that the most defensible approach is to model the process that caused the data to be missing. In concert with Wainer, the present paper emphasizes the value of specifying models; however, the theses of the present paper are that (a) any analysis should be interpreted in the context of a set of rival theoretical models, (b) these models should allow latent mediating variables that are imperfectly measured by observed variables, and (c) modeling brings clarity to the conclusion that confounded data can be misleading. Simple models clarify the limitations on conclusions one might otherwise attempt to draw from tainted data.
... Wainer (1989) cited Tukey's comment that statisticians are like lawyers (Wainer, 1986). Some lawyers tell you, "Don't do it!" ...
... The purpose of this article is to argue that scientists should examine any statistical analysis in the context of a set of rival theories. Our support of modeling is in partial agreement with Wainer's (1989) paper; however, we want to push modeling a step further and emphasize that theories often warn us to side with the type of lawyer who explains why certain actions are improper and should not be done. Simple theories help us understand why better schools might produce lower average Scholastic Aptitude Test (SAT) scores, why a pesticide might be a good idea even if crop yields are equal in treated and untreated fields when the number of eelworms is partialed out, and why-in the absence of bias-men might have higher salaries than women with the same merit and have higher merit than women who receive the same salaries. ...
... However, regression analysis has the property that the lower the correlation between merit and salary, the easier it is to falsely conclude that there is sex bias when b = 0, if regression adjustments are used. Wainer (1989) reviewed the eelworm study as a classic example of misuse of statistical adjustment. The mediated model of eelworms (substituting measured eelworms for measured merit, fumigation for sex, and crop yield for salary) shows why the effect of treatment with eelworms partialed out would not be an estimate of the treatment effect. ...
Article
Full-text available
This paper addresses the interpretation of data that are contaminated by self-selected samples and/or lack of experimental control. Wainer (1989) reviewed different methods for treating self-selected samples and concluded that the most defensible approach is to model the process that caused the data to be missing. In concert with Wainer, the present paper emphasizes the value of specifying models; however, the theses of the present paper are that (a) any analysis should be interpreted in the context of a set of rival theoretical models, (b) these models should allow latent mediating variables that are imperfectly measured by observed variables, and (c) modeling brings clarity to the conclusion that confounded data can be misleading. Simple models clarify the limitations on conclusions one might otherwise attempt to draw from tainted data.
... The basic philosophy of the proposed method can best be illustrated through the classical example due to Cochran (Wainer, 1989). Consider an experiment in which soil fumigants, X , are used to increase oat crop yields, Y, by controlling the eelworm population, Z, but may also have direct effects, both beneficial and adverse, on yields beside the control of eelworms. ...
... (vi) In Figs 6(e), (f) and (g), the identifiability of pi(y\x) is rendered feasible through observed covariates, Z, that are affected by the treatment X, that is descendants of X. This stands contrary to the warning, repeated in most of the literature on statistical experimentation, to refrain from adjusting for concomitant observations that are affected by the treatment (Cox, 1958, p. 48;Rosenbaum, 1984;Pratt & Schlaifer, 1988;Wainer, 1989). It is commonly believed that, if a concomitant Z is affected by the treatment, then it must be excluded from the analysis of the total effect of the treatment (Pratt & Schlaifer, 1988). ...
Article
SUMMARY The primary aim of this paper is to show how graphical models can be used as a mathematical language for integrating statistical and subject-matter information. In par- ticular, the paper develops a principled, nonparametric framework for causal inference, in which diagrams are queried to determine if the assumptions available are sufficient for identifying causal effects from nonexperimental data. If so the diagrams can be queried to produce mathematical expressions for causal effects in terms of observed distributions; otherwise, the diagrams can be queried to suggest additional observations or auxiliary experiments from which the desired inferences can be obtained.
... This randomization provides control for confounding bias (Pearl 2000, Ch. 6), which appears due to the existence of unobserved confounders (UCs) generating uncontrolled variations to the treatment and outcome. Randomization of the treatment allocation constitutes one of the pillars of modern experimental design (Fisher 1951;Wainer 1989) and broadly, within the scientific method itself. ...
Article
Full-text available
Randomized clinical trials (RCTs) like those conducted by the FDA provide medical practitioners with average effects of treatments, and are generally more desirable than observational studies due to their control of unobserved confounders (UCs), viz., latent factors that influence both treatment and recovery. However, recent results from causal inference have shown that randomization results in a subsequent loss of information about the UCs, which may impede treatment efficacy if left uncontrolled in practice (Bareinboim, Forney, and Pearl 2015). Our paper presents a novel experimental design that can be noninvasively layered atop past and future RCTs to not only expose the presence of UCs in a system, but also reveal patient- and practitioner-specific treatment effects in order to improve decision-making. Applications are given to personalized medicine, second opinions in diagnosis, and employing offline results in online recommender systems.
... And in past examples, such as attempts to statistically adjust states' average ACT or SAT scores to correct for differing participation rates, success has been limited. In fact, competing adjustment methods based on seemingly sensible assumptions have been found to produce very different rankings of the 50 states (see Wainer, 1989). ...
Article
Political perspectives on the advisability of state achievement comparisons have changed substantially over the last half-century. As Selden (2004) noted regarding the National Assessment of Educational Progress (NAEP), “[w]hen it began in the 1960s, NAEP specifically was designed not to provide comparisons among states. This was done for political reasons, so that the assessment program would be palatable to the educational community and to the states” (p. 195). During the No Child Left Behind Era, state comparisons were complicated by the use of tests and proficiency standards that differed across states. Because of today's Common Core Standards Initiative, state comparisons are once again at the forefront of educational policy discussions. The initiative brings with it both significant opportunities and substantial psychometric challenges.
... Special statistical approaches are needed when there are nonrandomly missing data because the assumptions underlying most statistical methods are violated. Wainer (1989) discussed the problems, some solutions, and the legitimate inferences allowed when using statistical adjustments to correct for nonresponse, particularly when the nonresponse is not completely random (i.e., nonignorable). Another limitation of analyses to detect the presence of attrition bias is that comparisons are often made between participants with complete interview data and those missing one or more follow-up interviews. ...
Article
Participant attrition poses a significant threat to the internal and external validity of panel studies, in part because participants who successfully complete all follow-up measurements often differ in significant ways from those respondents lost to attrition. The only certain safeguard against potential biases resulting from attrition is to ensure high interview completion rates during follow-up. Unfortunately, information about reducing preventable attrition is not discussed in most research reports and a comprehensive review paper has not yet been published. The purpose of the present paper is to provide a brief overview of how attrition can threaten the validity of panel studies and to discuss eight promising methods of minimizing attrition through the use of effective retention and tracking strategies. Attempts to reduce attrition are not always met with complete success, therefore, a brief discussion of statistical techniques to assess and correct for potential attrition biases is provided. Finally, methods of calculating attrition rates are suggested along with recommendations for future research.
Article
Although safety within operational systems depends on compliance with regulations, non-compliance is common in many settings. Trucking is a meaningful industry for studying operational safety compliance given that the industry is large and important, truck accidents kill thousands annually, and such accidents collectively cost the U.S. economy billions of dollars. Although the truck driving occupation is dominated by men, significant efforts are underway to recruit more women into the profession. If women are safer behind the wheel than men, increasing their ranks could improve overall safety compliance. Building on theory and evidence suggesting that men have a greater willingness to take risky actions and break rules, we used data on 22 million truck inspections from 2010 to 2022 to identify an operational safety compliance gap between men and women truckers. Overall, men were 7.4% more likely to be cited for a major violation of rules governing working hours (known as hours-of-service or HOS rules) and 13.2% more likely to have a major unsafe driving violation. We then examined whether gap changes based on carrier size and type. We found that the HOS compliance gap is smaller for small carriers (vs. large) and private carriers (vs. for-hire), but not the unsafe driving gap. Finally, we tested whether the introduction of an intervention—electronic logging devices (ELDs) that automatically record truckers’ driving hours—closes the gap by increasing men's compliance. In line with predictions, differences between men and women disappeared after the mandate; but again, only for HOS compliance. Surprisingly, women had significantly more HOS violations in 2021 and 2022 than men—an outcome that may be tied to women truckers’ personal safety issues. In summary, the results and additional robustness checks indicate that men committed more unsafe driving violations (e.g., speeding) than women across the entire study period, while the pattern of HOS violations varied based on external events. We conclude by highlighting possible pathways for reducing the number of collisions involving trucks and thus lowering the number of fatalities and extent of economic losses.
Presentation
What are the problems of the Luxembourg PISA Study? Identification and Solution by moderne statistics. 1. Identification of the level 2 units: School or educational tracks within school 2. Selection of the secondary school type by the director of the primary school and the school supervisor 3. Identification of the selection criteria and their statistical control 4. Development of a multilevel model incorporating the social and ethnical segregation between schools and the effekts of the students' exogenous variables within school using plausible values for the endogenous variable
Conference Paper
Equal opportunities in the educational system form the basis of a modern society. This study analyses whether students in Luxembourg have the same chances to access to higher secondary education independently of the social-economic status (SES) of their parents. First of all, an explorative view on the between-and within-school variation of students' attainment of mathematics is presented showing a strong contextual effect of the educational track within school. Secondly, a selection model for the Luxembourg Educational Councils' choice of the secondary school type is presented and tested with the Luxembourg PISA 2003 data set published by the OECD. It proves that the SES and primary school failures have a strong effect on the choice of school types. Thirdly, a multilevel model is presented which controls this selection process. It predicts students' achievement of mathematics by exogenous variables of student and school level. It demonstrates that beside the school types the SES has a statistical significant effect on attainment of mathematics within schools. Also the selection process, allocating the students to different school types, realizes an indirect effect of the SES between schools. In contrast to the OECD (2006) findings the quality of teaching indicators have not a significant effect on the achievement of mathematics. These strong effects of the SES challenge equal opportunities in the Luxembourg educational system.
Article
Full-text available
In this article, we extend the methodology of the Cut-Score Operating Function that we introduced previously and apply it to a testing scenario with multiple independent components and different testing policies. We derive analytically the overall classification error rate for a test battery under the policy when several retakes are allowed for individual components and also for when one is required to retake the whole battery. We derive the overall classification error rate using a flexible cost function defined by weights assigned to false negative and false positive errors. The result, shown graphically, is that competing demands of minimizing both false positive and false negative errors yield a unique optimal value for the cut-score. This cut-score can be estimated numerically for any number of components and any number of retakes. Among the results we obtain is that the more lenient the retake policy the higher one must set the cut-score to minimize the error rate.
Article
Full-text available
In 1936, The Literary Digest poll made a disastrous forecast: President Roosevelt would lose the election. George H. Gallup, one of the founding fathers of modern polling, believed the magazine could have avoided this outcome. The only thing the Digest had to do, he said, was to perform a “simple statistical correction” on the data. But Gallup was speaking from the point of view of an occupational creed foreign to the journalistic standards that informed the straw poll journalism practiced by The Literary Digest and other news publications in those days. This paper argues that new journalistic norms (e.g. “impartiality”) were the principal obstacle in the dissemination to the sphere of straw poll journalism of an emerging statistical technology, whose purpose was to evaluate and correct the raw data obtained by polls. The research shows that news-workers of that era did not view “statistical correction” as a legitimate journalistic practice. As a result, polling became, for many years thereafter, the specialty of experts outside the field of journalism.
Article
Estimates of program effects from a compensatory preschool intervention were investigated using several contemporary techniques of bias reduction. These included econometric simulta neous modeling, latent-variable structural modeling, and ordinary least squares regression. These techniques were applied to longitudinal data from Chicago's Child Parent Center preschool program, a Head Start-type intervention. Analyses of 806 Black children followed from kindergarten to Grade 6 indicated that the techniques, by and large, produced similar estimates of program impact on school achievement test scores. Effect sizes ranged from .43 to .67 at school entry to .24 to .30 at Grade 6 (7 years postprogram). These findings were robust across different model specifications and similar to those of randomized studies. The only differences among methods occurred at Grade 3. Findings suggest that different quasi-experi mental estimation methods that attempt to control for selection bias can generate similar estimates of the effects of an educational treatment.
Article
Quantitative phenomena can be displayed effectively in a variety of ways, but to do so requires an understanding of both the structure of the phenomena and the limitations of candidate display formats. This paper: • Recounts three historic instances of the vital role data displays played in important discoveries, • Provides three levels of information which form the basis of a theory of display to help us better measure both display quality and human graphicacy, and • Describes three steps to improve the quality of tabular presentation.
Article
Article
This report describes two horses that were diagnosed with equine protozoal myeloencephalitis (EPM) and were treated with nitazoxanide. The horses were treated once daily with 20 mg/kg nitazoxanide orally for either 28 or 42 days. Nitazoxanide improved or eliminated the clinical signs of EPM in these two horses. Neither horse relapsed following treatment during the follow-up periods noted.
Article
An examination of the library and information science (LIS) literature reveals that surveys published from 1996 through 2001 in three major LIS journals have an average response rate of 63%, and almost three fourths of the surveys have a response rate less than 75% (the level that is widely held to be required for generalizability). Consistent with the practice in other disciplines, however, most LIS researchers do not address the issue of nonresponse beyond reporting the survey response rate. This article describes a strategy that LIS researchers can use to deal with the problem of nonresponse. As a first step, they should use methodological strategies to minimize nonresponse. To address nonresponse that remains despite the use of these strategies, researchers should use one of the following strategies: careful justification of a decision simply to interpret survey results despite nonresponse, limiting survey conclusions in recognition of potential bias from nonresponse, or assessing and correcting for bias from nonresponse.
Article
In the smoking-cessation study, these steps were carried out in a somewhat different order, a discussion of which is needed on account of some confusion that ensued in a presentation by the first author of results from this project. The presentation was given to a group of public health graduate students, with the goal of underscoring the importance of considering the theoretical possibilities that motivate the use of ANCOVA. But in the actual smoking-cessation application, the process unfolded by first identifying 15 candidate predictor variables that could conceivably be associated with abstinence, then running a stepwise logistic regression analysis with the intervention indicator always included and an entry P value criterion of P < 0.10 for other predictors, and then evaluating the extent of imbalance on predictor variables that entered. The fitted model reported in the article, with six predictors in addition to the intervention indicator, emerged from the stepwise procedure, but it was only across the levels of the “Stage of Change” variable that there were significant imbalances across treatment arms. Almost all questions raised by the graduate students concerned the use of stepwise logistic regression, as the students had been warned of the perils of stepwise procedures and had been advised against their use.
Article
Longitudinal research designs with many waves of data have the potential to provide a fine-grained description of program impact, so they should be of special value for evaluation research. This potential has been illusive because our principal analysis methods are poorly suited to the task. We present strategies for analyzing these designs using hierarchical linear modeling (HLM). The basic growth curve models found in most longitudinal applications of HLM are not well suited to program evaluation, so we develop more appropriate alternatives. Our approach defines well-focused parameters that yield meaningful effect-size estimates and significance tests, efficiently combining all waves of data available for each subject. These methods do not require a uniform set of observations from all respondents. The Boys Town Follow-Up Study, an exceptionally rich but complex data set, is used to illustrate our approach.
Article
Full-text available
The competitiveness of a country depends on the ability to tap the full potential of its talents. Therefore, equal opportunities in its educational system are an important condition for this process. This study analyzes whether students in Luxembourg have the same chance to assess higher secondary education independently of the social-economic status (SES) of their parents. First of all, an explorative view on the between- and within-school variation of students’ attainment in mathematics is presented showing that a strong contextual effect of the schools exists. Secondly, a selection model for the Luxembourg Educational Authorities’ choice of the secondary school type is presented and tested with the Luxembourg PISA 2003 data set published by the OECD. It proves that the SES and primary school failure have a strong effect on the choice of grammar schools. Thirdly, a multilevel model is presented which controls this selection process. It predicts students’ achievement in mathematics by exogenous variables on student and school level. It demonstrates that beside the school types the SES has a statistical significant effect on attainment in mathematics between and within schools. In contrast to the OECD (2006) findings the quality of teaching indicators have not a significant effect on the achievement in mathematics. These strong effects of the SES challenge equal opportunities in the Luxembourg educational system. Finally some improvements of the research design are proposed for further research.
Article
The primary aim of this paper is to show how graphical models can be used as a mathematical language for integrating statistical and subject-matter information. In particular, the paper develops a principled, nonparametric framework for causal inference, in which diagrams are queried to determine if the assumptions available are sufficient for identifying causal effects from nonexperimental data. If so the diagrams can be queried to produce mathematical expressions for causal effects in terms of observed distributions; otherwise, the diagrams can be queried to suggest additional observations or auxiliary experiments from which the desired inferences can be obtained.
Article
In this article, we examine two attempts to adjust state mean Scholastic Aptitude Test (SAT) scores for differential participation rates. We show that both attempts can be rejected because of overly stringest identifying conditions in the estimation equations, as well as because within-state SAT score distributions indicate that the selection model employed is too restrictive. We suggest that attempts to do such adjustments ought to follow five simple rules. Adherence to these rules will foster follow-up checks on untested assumptions. We also suggest that soon-to-be available National Assessment of Educational Progress OIJAEP) state data is a better way to make state comparisons.
Article
A generation ago, Fred N. Kerlinger proposed that there were a number of myths that pervaded educational research. An overview of 3 specific myths—the methods, practicality, and statistics myths—is provided, followed by a discussion about the degree to which these myths have been overcome or still exist in educational research.
Article
Full-text available
Examines the importance of robust statistics in psychological research and suggests that the classical estimates of means, variances, and correlations are sensitive to small departures from the normal curve. Statisticians have urged caution in the use of classical statistics and have proposed a variety of alternatives that are robust with respect to departures from normality. Common sources of nonnormality in psychological data cleaning and robust estimation are examined. Robust estimation using M-estimators is discussed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Nonsampling errors are subtle, and strategies for dealing with them are not particularly well known within psychology. This article provides a compelling example of an incorrect conclusion drawn from a nonrandom sample: H. C. Lombard's (1835) mortality data. This example is augmented by a second example (A. Wald, 1980) that shows how modeling the selection mechanism can correct for the bias introduced by nonsampling errors. These 2 examples are then connected to modern statistical methods that through the method of multiple imputation allow researchers to assess uncertainty in observational studies. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Chapter
This chapter reviews various methodological innovations in life-span research that have come as a direct result of advances in dealing with incomplete data using structural equation models (SEMs). As with most topics in methodology, the newest approaches are not radically new at all, and much of what we present here is based on classical considerations from the analysis of variance (ANOVA). The broad methodological topics include statistical power, multivariate scale and item measurement, and longitudinal and dynamic measurements. This presentation emphasizes the use of available longitudinal data to examine the statistical approximations that would result if less data were actually available—a technique used by Bell (1954). This same technique is used to deal with approximations resulting from having less people, less scales, less items, less occasions, and less dynamic information. Recent examples from the Berkley-Bradway longitudinal data are presented to illustrate the life-span data we really do need. Keywords: structural equation modeling; incomplete data; life-span data; convergence; acceleration; anova designs; adaptive testing; selection effects; dynamic effects
Article
In this paper we develop pseudo-likelihood methods for the estimation of parameters in a model that is specified in terms of both selection modelling and pattern-mixture modelling quantities. Two cases are considered: (1) the model is specified directly from a joint model for the measurement and dropout processes; (2) conditional models for the measurement process given dropout and vice versa are specified directly. In the latter case, compatibility constraints to ensure the existence of a joint density are derived. The method is applied to data from a psychiatric study, where a bivariate therapeutic outcome is supplemented with covariate information.
Article
History teaches the continuity of science; the developments of tomorrow have their genesis in the problems of today. Thus any attempt to look forward is well begun with an examination of unsettled questions. Since a clearer idea of where we are going smoothes the path into the unknown future, a periodic review of such questions is prudent. The present day, lying near the juncture of the centuries, is well suited for such a review. This article reports 16 unsolved problems in educational measurement and points toward what seem to be promising avenues of solution.
Chapter
This chapter is about central ideas about how research on the bridge between cognitive psychology and theories of learning and instruction can be conducted in general and what methodological and logical traps may come with this special endeavor. These conclusions can be made on the basis of decades of consistent and complementary research on mental model theory and on model based learning and instruction. The chapter begins with a presentation and discussion of design experiments and extended design experiments from the tradition of experimental research and their relations to practical feasibility and to different traditions and paradigms from the philosophy of science. Then, common and new metaphors for the interpretation of effects for the empirical levels of the methodo logical assumptions are introduced and discussed against the backdrop of applied theory of learning and instruction.
Article
Human services interventions are most rigorously evaluated with true experimental designs in longitudinal experimental field trials (LEFTs). However, differential self-selection or attrition often pose a serious threat to the LEFTs internal validity. This threat can be largely overcome by describing all conditions in advance to prospective subjects and securing their agreement to participate in and complete whichever condition is selected at random by a Lottery. This solution, however, in turn then poses the external validity problem that the program's effects on those who would participate in a Lottery may well be different than its effects on those who would participate in any single condition. In the present paper, we describe a new design, termed the Combined Modified Design, which assesses and overcomes this problem. This new design, in which a modified version of the Randomized Invitation Design (in which only one condition, assigned at random, is described to a potential subject, but outcome measures are obtained on everyone) is combined with the Lottery LEFT, is illustrated with a hypothetical example.
Article
Full-text available
Data from the 1992 National Assessment of Educational Progress are used to compare the performance of New Jersey public school children with those from other participating states. The comparisons are made with the raw means scores and after standardizing all state scores to a common (National U.S.) demographic mixture. It is argued that for most plausible questions about the performance of public schools the standardized scores are more useful. Also, it is shown that if New Jersey is viewed as an independent nation, its students finished sixth among all the nations participating in the 1991 International Mathematics Assessment.
Article
Full-text available
An important, frequent, and unresolved problem in treatment research is deciding how to analyze outcome data when some of the data are missing. After a brief review of alternative procedures and the underlying models on which they are based, an approach is presented for dealing with the most common situation--comparing the outcome results in a 2-group, randomized design in the presence of missing data. The proposed analysis is based on the concept of "modeling our ignorance" by examining all possible outcomes, given a known number of missing results with a binary outcome, and then describing the distribution of those results. This method allows the researcher to define the range of all possible results that could have resulted had the missing data been observed. Extensions to more complex designs are discussed.
Article
To investigate the validity of five prevalent negative beliefs about residential placement, we followed adolescents from a residential program and a comparison group at 3-month intervals for 4 to 8 years. This residential program in the Midwest uses the Teaching-Family Model in which six to eight adolescents live in a family-style environment. The interviews included five scales reflecting youths' views about important aspects of their lives in placement: (1) Delivery of Helpful Treatment, (2) Satisfaction with Supervising Adults, (3) Isolation from Family, (4) Isolation from Friends, and (5) Sense of Personal Control. Hierarchical linear modeling allowed us to estimate group differences while controlling for developmental trends, demographic factors, and prior differences between groups. The two groups were equivalent on all scales before the study. During the following placement, however, the treatment group's ratings were significantly more positive than the comparison group on four of the five scales and approached significance on the fifth. These findings suggest that negative beliefs about life in residential placement for adolescents may not apply to all programs.
Article
In summary, nonselection and nonresponse bias can have a potent impact on the validity of clinical veterinary research studies and should be carefully assessed by investigators and readers. The risk of nonselection and nonresponse bias has been compared to "lowering yourself into a dark pit and trusting you won't be bitten by a snake ... before you go into the pit, you should stand outside and listen for a hissing sound ... if you hear one, do not go on ... if you do not hear anything, you may proceed with caution, being confident that at worst you will be bitten by a quiet snake."
Article
Full-text available
Surveys have been, and will most likely continue to be, the source of data for many empirical articles. Likewise, the difficulty of making valid statistical inferences in the face of missing data will continue to plague researchers. In an ideal situation, all potential survey participants would respond; in reality, the goal of an 80 to 90% response rate is very difficult to achieve. When nonresponse is systematic, the combination of low response rate and systematic differences can severely bias inferences that are made by the researcher to the population. It is important for the researcher to assess the potential causes of nonresponse and the differences between the observed values in the sample compared to what may have been gained if the sample was complete, particularly when the response rate is low. There are methods available that substitute imputed values for missing data, but these methods are useless if the researcher lacks knowledge of how the responders and nonresponders may differ. With regard to statistical inference, the researcher also should be aware of the difference between a convenient sample and a probability sample. Valid statistical inference assumes that the probability of characteristics observed in the sample bear some relationship to their occurrence in the population. For example, in a simple random sample each member of the accessible population has an equal chance of inclusion in the sample. A convenient sample lacks the statistical properties of a probability sample that allow the validity of its inferences to be assessed strictly from a mathematical framework. The context of the research and the type of data being gathered greatly affect the validity of any generalizations the researcher makes with regard to the population the convenient sample attempts to represent.
Article
This article (which is mainly expository) sets up graphical models for causation, having a bit less than the usual complement of hypothetical counterfactuals. Assuming the invariance of error distributions may be essential for causal inference, but the errors themselves need not be invariant. Graphs can be interpreted using conditional distributions, so that we can better address connections between the mathematical framework and causality in the world. The identification problem is posed in terms of conditionals. As will be seen, causal relationships cannot be inferred from a data set by running regressions unless there is substantial prior knowledge about the mechanisms that generated the data. There are few successful applications of graphical models, mainly because few causal pathways can be excluded on a priori grounds. The invariance conditions themselves remain to be assessed.
Article
School performance and attitudes of a group of children placed in residential care were assessed during placement and for an average of four years after discharge. A comparison group of children who were not placed in the program was also followed. The residential program emphasized both behavioral and educational treatment. Group differences were tested using Hierarchical Linear Modeling (HLM). Results indicated that the treatment group had significantly greater improvements in both school performance and attitudes during placement. These differences were also maintained after discharge. It is suggested that long-term educational effects with troubled children may require an intensive intervention over an extended period of time.
Article
Full-text available
While he was a member of the Statistical Research Group (SRG), Abraham Wald worked on the problem of estimating the vulnerability of aircraft, using data obtained from survivors. This work was published as a series of SRG memoranda and was used in World War II and in the wars in Korea and Vietnam. The memoranda were recently reissued by the Center for Naval Analyses. This article is a condensation and exposition of Wald's work, in which his ideas and methods are described. In the final section, his main results are reexamined in the light of classical statistical theory and more recent work.
Article
Full-text available
The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two- dimensional plot.
Article
Full-text available
The performance of minority examinees on the SAT is carefully monitored by the national educational media. Changes of 10 or 15 points over a five-year period are interpreted as having a significant and important relationship to the educational process. A crucial assumption underlying the validity of this inference is that the performance of an examinee is unrelated to that examinee’s choosing to identify his or her ethnicity. In this article, it is shown that this assumption is false and that the potential errors introduced by it dwarf the changes being interpreted as real.
Article
Full-text available
Presents a discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation. The objective was to specify the benefits of randomization in estimating causal effects of treatments. It is concluded that randomization should be employed whenever possible but that the use of carefully controlled nonrandomized data to estimate causal effects is a reasonable and necessary procedure in many cases. (15 ref) (PsycINFO Database Record (c) 2006 APA, all rights reserved).
Book
During the course of the rhetoric surrounding the 1984 Presidential election campaign in the United States there were a variety of statements made that gave me pause. For example, I heard candidate Ferraro explain her poor showing in pre·election polls by saying. " I don't believe those polls. If you could see the enthusiasm for our candidacy 'out there' you wouldn't believe them either. " Obviously, trying to estimate one's popularity in the entire voting population from the enthusiasm of your supporters at political rallies is not likely to yield accurate results. I suspect that trying to statistically adjust the " rally estimate" through the use of the demographic characteris· tics of those who attend would not have belped enough to be usefuL A modest survey on a more randomly chosen sample would surely have been better. At about the same time, Secretary of Education Terrell Bell released a table entitled State Education Statistics. Among other bits of information, it contained the mean scores on the Scholastic Aptitude Test (the SAT) for 22 of the states. The College Board had previously released these mean scores for all states. At this point the mass media began carrying reports inter­ preting the differences. The Reagan White House pointed out that spending more money on education was not the way to improve educational outcomes. To support this they pointed to the mean SAT scores of Connecticut and New Hampshire. New Hampshire had modestly higher SAT scores but lower . . per pupil expenditure.
Chapter
Consider the following simplified setting of the problem discussed by Dr. Rubin. (Y, R) have a joint distribution over a population of units in which Y is a variable of interest and R = 0 for a unit if it is a nonrespondent and R = 1 otherwise. Dr. Rubin considers the problem of specifying the joint distribution of (Y, R) and distinguishes two methods of doing this: (M) Mixture model:specifyY|RandR,({\text{M}}){\text{ Mixture model:}}\,{\text{specify}}\,Y{\text{|}}R\,{\text{and}}\,R,({\text{S}})\,{\text{Selection model:}}\,{\text{specify}}\,R|Y\,{\text{and}}\,Y.
Chapter
It is sometimes suspected that nonresponse to a sample survey is related to the primary outcome variable. This is the case, for example, in studies of income or of alcohol consumption behaviors. If nonresponse to a survey is related to the level of the outcome variable, then the sample mean of this outcome variable based on the respondents will generally be a biased estimate of the population mean. If this outcome variable has a linear regression on certain predictor variables in the population, then ordinary least squares estimates of the regression coefficients based on the responding units will generally be biased unless nonresponse is a stochastic function of these predictor variables. The purpose of this paper is to discuss the performance of two alternative approaches, the selection model approach and the mixture model approach, for obtaining estimates of means and regression estimates when nonresponse depends on the outcome variable. Both approaches extend readily to the situation when values of the outcome variable are available for a subsample of the nonrespondents, called “follow-ups.” The availability of follow-ups are a feature of the example we use to illustrate comparisons.
Article
Public attention has been drawn to recent reports of state-by-state variation in standardized test scores, in particular the Scholastic Aptitude Test (SAT). In this paper, Brian Powell and Lala Carr Steelman attempt to show how the dissemination of uncorrected state SAT scores may have created an inaccurate public and governmental perception of the variation in educational quality. Their research demonstrates that comparing state SAT averages is illadvised unless these ratings are corrected for compositional and demographic factors for which states may not be directly responsible.
Article
In January 1984 and again in January 1985, then Secretary of Education Bell released the table "State Education Statistics." These tables contained a variety of education indicators, among them average SAT or ACT scores for each state. In this paper we examine these scores to see if they can be used for state-by-state comparisons to aid in the evaluation of those educational policies that vary across states. We conclude that statistical adjustment to remove the bias introduced by inappropriate aggregation and self-selection of examinees is not sufficient to insure the validity of the kinds of inferences that are desired.
Article
Adjustments for bias in observational studies are not always confined to variables that were measured prior to treatment. Estimators that adjust for a concomitant variable that has been affected by the treatment are generally biased. The bias may be written as the sum of two easily interpreted components: one component is present only in observational studies; the other is common to both observational studies and randomized experiments. The first component of bias will be zero when the affected posttreatment concomitant variable is, in a certain sense, a surrogate for an unobserved pretreatment variable. The second component of bias can often be addressed by an appropriate sensitivity analysis.
Article
Since 1980 the decline in SAT scores has stopped and the scores have started to creep back up. The scores for white Americans have increased 8 points during this period and 15 points for non-whites. It was thus surprising to discover that the overall mean increased only 7 points. This is not an arithmetic error but rather an example of a well-known statistical phenomenon called Simpson's Paradox. In this note we explain the paradox and describe a method that will avoid it in the future.
Article
Suggests that the advances made by minorities in test performance reported by L. V. Jones (see record 1985-26568-001) must be interpreted with some skepticism in light of the self-selecting nature of the College Board Scholastic Achievement Test and the 120-point gap in scores on this test between Blacks and Whites. (1 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Social scientists never have access to true experimental data of the type sometimes available to laboratory scientists.1 Our inability to use laboratory methods to independently vary treatments to eliminate or isolate spurious channels of causation places a fundamental limitation on the possibility of objective knowledge in the social sciences. In place of laboratory experimental variation, social scientists use subjective thought experiments. Assumptions replace data. In the jargon of modern econometrics, minimal identifying assumptions are invoked.
Article
Treats studies, primarily in human populations, that show casual effects of certain agents, procedures, treatment or programs. Deals with the difficulties that comparative observational studies have because of bias in their design and analysis. Systematically considers the many sources of bias and discusses how care in matching or adjustment of results can reduce the effects of bias in these investigations.
Article
If treatment assignment is strongly ignorable, then adjustment for observed covariates is sufficient to produce consistent estimates of treatment effects in observational studies. A general approach to testing this critical assumption is developed and applied to a study of the effects of nuclear fallout on the risk of childhood leukemia. R.A. Fisher's advice on the interpretation of observational studies was “Make your theories elaborate”; formally, make causal theories sufficiently detailed that, under the theory, strongly ignorable assignment has testable consequences.
Article
This is an invited expository article for The American Statistician. It reviews the nonparametric estimation of statistical error, mainly the bias and standard error of an estimator, or the error rate of a prediction rule. The presentation is written at a relaxed mathematical level, omitting most proofs, regularity conditions, and technical details.
Article
Recent research (Page & Feifs, 1985; Powell & Steelman, 1984; Steelman & Powell, 1985) attempts to draw inferences about the relative standing of the states on the basis of mean SAT scores. These papers point out that this cannot be done without statistically adjusting for various differences among the states. In this paper I identify five serious errors that, when made, call into question the validity of such inferences. In addition, I describe some plausible ways to avoid the errors.
Article
The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two-dimensional plot.
Article
Problems involving causal inference have dogged at the heels of statistics since its earliest days. Correlation does not imply causation, and yet causal conclusions drawn from a carefully designed experiment are often valid. What can a statistical model say about causation? This question is addressed by using a particular model for causal inference (Holland and Rubin 1983; Rubin 1974) to critique the discussions of other writers on causation and causal inference. These include selected philosophers, medical researchers, statisticians, econometricians, and proponents of causal modeling.
Article
Causal effects are comparisons among values that would have been observed under all possible assignments of treatments to experimental units. In an experiment, one assignment of treatments is chosen and only the values under that assignment can be observed. Bayesian inference for causal effects follows from finding the predictive distribution of the values under the other assignments of treatments. This perspective makes clear the role of mechanisms that sample experimental units, assign treatments and record data. Unless these mechanisms are ignorable (known probabilistic functions of recorded values), the Bayesian must model them in the data analysis and, consequently, confront inferences for causal effects that are sensitive to the specification of the prior distribution of the data. Moreover, not all ignorable mechanisms can yield data from which inferences for causal effects are insensitive to prior specifications. Classical randomized designs stand out as especially appealing assignment mechanisms designed to make inference for causal effects straightforward by limiting the sensitivity of a valid Bayesian analysis.
Article
A common type of observational study compares population rates in several regions having differing policies in an effort to assess the effects of those policies. In many studies, particularly in public health and epidemiology, age-adjusted rates are regressed on predictor variables to give a covariance-adjusted estimate of effect; this estimate is shown to be generally biased for the appropriate regression coefficient. For familiar models, the analysis of crude rates with age as a covariate can lead to unbiased estimates, and therefore can be preferable. Several other regression methods are also considered.