Article

Improving Transparency in Observational Social Science Research: A Pre-Analysis Plan Approach

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Social science research is increasingly data-driven, rigorous, and policy-relevant, but is at risk of being devalued due to evidence of the prevalence of problematic research practices and norms. This has led to growing interest in transparency practices in the social sciences. At present, the bulk of this work is centered around randomized controlled trials, which constitute a small fraction of social science research. I propose three scenarios in which study pre-registration can be credibly applied to non-experimental research. I outline suggested contents for observational pre-analysis plans, and highlight where these plans should deviate from pre-analysis plans designed for experimental research.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... For example, it might be possible to preregister studies of observational data in a way where it is possible to verify that the pre-analysis plan truly preceded the data analysis (for a discussion of this issue in medical research, see Dal-Ré et al. 2014). One can imagine a preregistration approach for studies that will be conducted after a particular event has occurred (such as an election or data release) or more generally before scholars have been granted access to restricted data (Burlig 2018;Christensen, Freese, and Miguel 2019). Ofosu and Posner (2020b) find that roughly 4 percent of pre-analysis plans that they reviewed were for observational data: in fact, among some studies discussed earlier, both Blanco-Perez and ...
Article
A decade ago, the term “research transparency” was not on economists' radar screen, but in a few short years a scholarly movement has emerged to bring new open science practices, tools and norms into the mainstream of our discipline. The goal of this article is to lay out the evidence on the adoption of these approaches – in three specific areas: open data, pre-registration and pre-analysis plans, and journal policies – and, more tentatively, begin to assess their impacts on the quality and credibility of economics research. The evidence to date indicates that economics (and related quantitative social science fields) are in a period of rapid transition toward new transparency-enhancing norms. While solid data on the benefits of these practices in economics is still limited, in part due to their relatively recent adoption, there is growing reason to believe that critics' worst fears regarding onerous adoption costs have not been realized. Finally, the article presents a set of frontier questions and potential innovations.
... This seems to suggest there is value in finding creative ways to credibly document limited data access while writing a pre-analysis plan, in order to curtail concerns regarding p-hacking in observational studies (Burlig, 2018). Neumark (2001) found a way around this challenge. ...
Article
Full-text available
Economists have recently adopted preanalysis plans in response to concerns about robustness and transparency in research. The increased use of registered preanalysis plans has raised competing concerns that detailed plans are costly to create, overly restrictive, and limit the type of inspiration that stems from exploratory analysis. We consider these competing views of preanalysis plans, and make a careful distinction between the roles of preanalysis plans and registries, which provide a record of all planned research. We propose a flexible “packraft” preanalysis plan approach that offers benefits for a wide variety of experimental and nonexperimental applications in applied economics. JEL CLASSIFICATION A14; B41; C12; C18; C90; O10; Q00
Article
Credible economic research demands discipline and defensible modeling assumptions—both theoretical and empirical—but incentives to strategically shape findings (e.g., p‐hack) can be strong. We examine recent waves of empiricism in economics and the ethical concerns and responses they prompted. Statistical abuses that opportunistically search for significance are often inseparable from conceptual abuses of opportunistic model identification (i.e., p‐hacking writ large). We compare neoclassical with positivist hacking proclivities and explore associated implications for empirical analysis and peer review. Drawing on our experiences, 25 years apart, as AJAE editors we reflect on efforts to evaluate research quality and enhance research transparency.
Article
Full-text available
Empirically analyzing empirical evidence One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study. Science , this issue 10.1126/science.aac4716
Article
Full-text available
Transparency, openness, and reproducibility are readily recognized as vital features of science (1, 2). When asked, most scientists embrace these features as disciplinary norms and values (3). Therefore, one might expect that these valued features would be routine in daily practice. Yet, a growing body of evidence suggests that this is not the case (4–6).
Article
Full-text available
The PLOS Medicine Editors endorse four measures to ensure transparency in the analysis and reporting of observational studies. Please see later in the article for the Editors' Summary.
Article
Full-text available
In randomized trials, pair-matching is an intuitive design strategy to protect study validity and to potentially increase study power. In a common design, candidate units are identified, and their baseline characteristics used to create the best n/2 matched pairs. Within the resulting pairs, the intervention is randomized, and the outcomes measured at the end of follow-up. We consider this design to be adaptive, because the construction of the matched pairs depends on the baseline covariates of all candidate units. As a consequence, the observed data cannot be considered as n/2 independent, identically distributed pairs of units, as common practice assumes. Instead, the observed data consist of n dependent units. This paper explores the consequences of adaptive pair-matching in randomized trials for estimation of the average treatment effect, conditional the baseline covariates of the n study units. By avoiding estimation of the covariate distribution, estimators of this conditional effect will often be more precise than estimators of the marginal effect. We contrast the unadjusted estimator with targeted minimum loss based estimation and show substantial efficiency gains from matching and further gains with adjustment. This work is motivated by the Sustainable East Africa Research in Community Health study, an ongoing community randomized trial to evaluate the impact of immediate and streamlined antiretroviral therapy on HIV incidence in rural East Africa. Copyright © 2014 John Wiley & Sons, Ltd.
Article
Full-text available
The vast majority of health-related observational studies are not prospectively registered and the advantages of registration have not been fully appreciated. Nonetheless, international standards require approval of study protocols by an independent ethics committee before the study can begin. We suggest that there is an ethical and scientific imperative to publicly preregister key information from newly approved protocols, which should be required by funders. Ultimately, more complete information may be publicly available by disclosing protocols, analysis plans, data sets, and raw data.
Article
Full-text available
There is growing appreciation for the advantages of experimentation in the social sciences. Policy-relevant claims that in the past were backed by theoretical arguments and inconclusive correlations are now being investigated using more credible methods. Changes have been particularly pronounced in development economics, where hundreds of randomized trials have been carried out over the last decade. When experimentation is difficult or impossible, researchers are using quasi-experimental designs. Governments and advocacy groups display a growing appetite for evidence-based policy-making. In 2005, Mexico established an independent government agency to rigorously evaluate social programs, and in 2012, the U.S. Office of Management and Budget advised federal agencies to present evidence from randomized program evaluations in budget requests (1, 2).
Article
Full-text available
In 2008, a group of uninsured low-income adults in Oregon was selected by lottery to be given the chance to apply for Medicaid. This lottery provides an opportunity to gauge the effects of expanding access to public health insurance on the health care use, financial strain, and health of low-income adults using a randomized controlled design. In the year after random assignment, the treatment group selected by the lottery was about 25 percentage points more likely to have insurance than the control group that was not selected. We find that in this first year, the treatment group had substantively and statistically significantly higher health care utilization (including primary and preventive care as well as hospitalizations), lower out-of-pocket medical expenditures and medical debt (including fewer bills sent to collection), and better self-reported physical and mental health than the control group. JEL Codes: H51, H75, I1.
Article
Full-text available
For any given research area, one cannot tell how many studies have been conducted but never reported. The extreme view of the "file drawer problem" is that journals are filled with the 5% of the studies that show Type I errors, while the file drawers are filled with the 95% of the studies that show nonsignificant results. Quantitative procedures for computing the tolerance for filed and future null results are reported and illustrated, and the implications are discussed. (15 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
In this article, we accomplish two things. First, we show that despite empirical psychologists' nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.
Article
Full-text available
Just over a quarter century ago, Edward Leamer (1983) reflected on the state of empirical work in economics. He urged empirical researchers to “take the con out of econometrics” and memorably observed (p. 37): “Hardly anyone takes data analysis seriously. Or perhaps more accurately, hardly anyone takes anyone else’s data analysis seriously.” Leamer was not alone; Hendry (1980), Sims (1980), and others writing at about the same time were similarly disparaging of empirical practice. Reading these commentaries, we wondered as late-1980s Ph.D. students about the prospects for a satisfying career doing applied work. Perhaps credible empirical work in economics is a pipe dream. Here we address the questions of whether the quality and the credibility of empirical work have increased since Leamer’s pessimistic assessment. Our views are necessarily colored by the areas of applied microeconomics in which we are active, but we look over the fence at other areas as well.
Article
Full-text available
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Article
Full-text available
Existing strategies for econometric analysis related to macroeconomics are subject to a number of serious objections, some recently formulated, some old. These objections are summarized in this paper, and it is argued that taken together they make it unlikely that macroeconomic models are in fact over identified, as the existing statistical theory usually assumes. The implications of this conclusion are explored, and an example of econometric work in a non-standard style, taking account of the objections to the standard style, is presented.
Article
Full-text available
The primary aim of the paper is to place current methodological discussions in macroeconometric modeling contrasting the ‘theory first’ versus the ‘data first’ perspectives in the context of a broader methodological framework with a view to constructively appraise them. In particular, the paper focuses on Colander’s argument in his paper “Economists, Incentives, Judgement, and the European CVAR Approach to Macroeconometrics” contrasting two different perspectives in Europe and the US that are currently dominating empirical macroeconometric modeling and delves deeper into their methodological/philosophical underpinnings. It is argued that the key to establishing a constructive dialogue between them is provided by a better understanding of the role of data in modern statistical inference, and how that relates to the centuries old issue of the realisticness of economic theories.
Article
Full-text available
[eng] Transportation costs and monopoly location in presence of regional disparities. . This article aims at analysing the impact of the level of transportation costs on the location choice of a monopolist. We consider two asymmetric regions. The heterogeneity of space lies in both regional incomes and population sizes: the first region is endowed with wide income spreads allocated among few consumers whereas the second one is highly populated however not as wealthy. Among the results, we show that a low transportation costs induces the firm to exploit size effects through locating in the most populated region. Moreover, a small transport cost decrease may induce a net welfare loss, thus allowing for regional development policies which do not rely on inter-regional transportation infrastructures. cost decrease may induce a net welfare loss, thus allowing for regional development policies which do not rely on inter-regional transportation infrastructures. [fre] Cet article d�veloppe une statique comparative de l'impact de diff�rents sc�narios d'investissement (projet d'infrastructure conduisant � une baisse mod�r�e ou � une forte baisse du co�t de transport inter-r�gional) sur le choix de localisation d'une entreprise en situation de monopole, au sein d'un espace int�gr� compos� de deux r�gions aux populations et revenus h�t�rog�nes. La premi�re r�gion, faiblement peupl�e, pr�sente de fortes disparit�s de revenus, tandis que la seconde, plus homog�ne en termes de revenu, repr�sente un march� potentiel plus �tendu. On montre que l'h�t�rog�n�it� des revenus constitue la force dominante du mod�le lorsque le sc�nario d'investissement privil�gi� par les politiques publiques conduit � des gains substantiels du point de vue du co�t de transport entre les deux r�gions. L'effet de richesse, lorsqu'il est associ� � une forte disparit� des revenus, n'incite pas l'entreprise � exploiter son pouvoir de march� au d�triment de la r�gion l
Article
We attempt to replicate 67 papers published in 13 well-regarded economics journals using author-provided replication files that include both data and code. Some journals in our sample require data and code replication files, and other journals do not require such files. Aside from 6 papers that use confidential data, we obtain data and code replication files for 29 of 35 papers (83%) that are required to provide such files as a condition of publication, compared to 11 of 26 papers (42%) that are not required to provide data and code replication files. We successfully replicate the key qualitative result of 22 of 67 papers (33%) without contacting the authors. Excluding the 6 papers that use confidential data and the 2 papers that use software we do not possess, we replicate 29 of 59 papers (49%) with assistance from the authors. Because we are able to replicate less than half of the papers in our sample even with help from the authors, we assert that economics research is usually not replicable. We conclude with recommendations on improving replication of economics research.
Article
Using 50,000 tests published in the AER, JPE, and QJE, we identify a residual in the distribution of tests that cannot be explained solely by journals favoring rejection of the null hypothesis. We observe a two-humped camel shape with missing p-values between 0.25 and 0.10 that can be retrieved just after the 0.05 threshold and represent 10-20 percent of marginally rejected tests. Our interpretation is that researchers inflate the value of just-rejected tests by choosing "significant" specifications. We propose a method to measure this residual and describe how it varies by article and author characteristics.
Article
Not just turnout, but turnaround matters In the last several U.S. presidential elections, the campaign mantra has focused on making sure that voters already aligned with one's candidate do get out to vote. There is a long history of unsuccessful efforts to change people's attitudes. Nevertheless, Broockman and Kalla conducted a field experiment showing that Miami voters shifted their attitudes toward transgender individuals and maintained those changed positions for 3 months (see the Perspective by Paluck). Science , this issue p. 220 ; see also p. 147
Article
Another social science looks at itself Experimental economists have joined the reproducibility discussion by replicating selected published experiments from two top-tier journals in economics. Camerer et al. found that two-thirds of the 18 studies examined yielded replicable estimates of effect size and direction. This proportion is somewhat lower than unaffiliated experts were willing to bet in an associated prediction market, but roughly in line with expectations from sample sizes and P values. Science , this issue p. 1433
Article
Imagine a nefarious researcher in economics who is only interested in finding a statistically significant result of an experiment. The researcher has 100 different variables he could examine, and the truth is that the experiment has no impact. By construction, the researcher should find an average of five of these variables statistically significantly different between the treatment group and the control group at the 5 percent level—after all, the exact definition of 5 percent significance implies that there will be a 5 percent false rejection rate of the null hypothesis that there is no difference between the groups. The nefarious researcher, who is interested only in showing that this experiment has an effect, chooses to report only the results on the five variables that pass the statistically significant threshold. If the researcher is interested in a particular sign of the result—that is, showing that this program “works” or “doesn’t work”— on average half of these results will go in the direction the researcher wants. Thus, if a researcher can discard or not report all the variables that do not agree with his desired outcome, the researcher is virtually guaranteed a few positive and statistically significant results, even if in fact the experiment has no effect.
Article
The social sciences—including economics—have long called for transparency in research to counter threats to producing robust and replicable results. In this paper, we discuss the pros and cons of three of the more prominent proposed approaches: pre-analysis plans, hypothesis registries, and replications. They have been primarily discussed for experimental research, both in the field including randomized control trials and the laboratory, so we focus on these areas. A pre-analysis plan is a credibly fixed plan of how a researcher will collect and analyze data, which is submitted before a project begins. Though pre-analysis plans have been lauded in the popular press and across the social sciences, we will argue that enthusiasm for pre-analysis plans should be tempered for several reasons. Hypothesis registries are a database of all projects attempted; the goal of this promising mechanism is to alleviate the "file drawer problem," which is that statistically significant results are more likely to be published, while other results are consigned to the researcher's "file drawer." Finally, we evaluate the efficacy of replications. We argue that even with modest amounts of researcher bias—either replication attempts bent on proving or disproving the published work, or poor replication attempts—replications correct even the most inaccurate beliefs within three to five replications. We offer practical proposals for how to increase the incentives for researchers to carry out replications.
Article
This paper investigates consumer inertia in health insurance markets, where adverse selection is a potential concern. We leverage a major change to insurance provision that occurred at a large firm to identify substantial inertia, and develop and estimate a choice model that also quantifies risk preferences and ex ante health risk. We use these estimates to study the impact of policies that nudge consumers toward better decisions by reducing inertia. When aggregated, these improved individual-level choices substantially exacerbate adverse selection in our setting, leading to an overall reduction in welfare that doubles the existing welfare loss from adverse selection.
Article
Regulatory oversight of toxic emissions from industrial plants and understanding about these emissions' impacts are in their infancy. Applying a research design based on the openings and closings of 1,600 industrial plants to rich data on housing markets and infant health, we find that: toxic air emissions affect air quality only within 1 mile of the plant; plant openings lead to 11 percent declines in housing values within 0.5 mile or a loss of about $4.25 million for these households; and a plant's operation is associated with a roughly 3 percent increase in the probability of low birthweight within 1 mile.
Article
We studied publication bias in the social sciences by analyzing a known population of conducted studies—221 in total—in which there is a full accounting of what is published and unpublished. We leveraged Time-sharing Experiments in the Social Sciences (TESS), a National Science Foundation–sponsored program in which researchers propose survey-based experiments to be run on representative samples of American adults. Because TESS proposals undergo rigorous peer review, the studies in the sample all exceed a substantial quality threshold. Strong results are 40 percentage points more likely to be published than are null results and 60 percentage points more likely to be written up. We provide direct evidence of publication bias and identify the stage of research production at which publication bias occurs: Authors do not write up and submit null findings.
Article
I argue that requiring authors to post the raw data supporting their published results has the benefit, among many others, of making fraud much less likely to go undetected. I illustrate this point by describing two cases of suspected fraud I identified exclusively through statistical analysis of reported means and standard deviations. Analyses of the raw data behind these published results provided invaluable confirmation of the initial suspicions, ruling out benign explanations (e.g., reporting errors, unusual distributions), identifying additional signs of fabrication, and also ruling out one of the suspected fraud's explanations for his anomalous results. If journals, granting agencies, universities, or other entities overseeing research promoted or required data posting, it seems inevitable that fraud would be reduced.
Article
We review the statistical models applied to test for heterogeneous treatment effects in the recent empirical literature, with a particular focus on data from randomized field experiments. We show that testing for heterogeneous treatment effects is highly common, and likely to result in a large number of false discoveries when conventional standard errors are applied. We demonstrate that applying correction procedures developed in the statistics literature can fully address this issue, and discuss the implications of multiple testing adjustments for power calculations and experimental design.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Nonlinear pricing and taxation complicate economic decisions by creating multiple marginal prices for the same good. This paper provides a framework to uncover consumers' perceived price of nonlinear price schedules. I exploit price variation at spatial discontinuities in electricity service areas, where households in the same city experience substantially different nonlinear pricing. Using household-level panel data from administrative records, I find strong evidence that consumers respond to average price rather than marginal or expected marginal price. This suboptimizing behavior makes nonlinear pricing unsuccessful in achieving its policy goal of energy conservation and critically changes the welfare implications of nonlinear pricing.
Article
Neumark (2001) used the novel methodology of a prespecified research design to estimate the employment effect of minimum wage changes. We conducted our analysis in the “spirit” of this methodology based on Canadian data from 1981 to 1997. Our minimum wage elasticities are substantial, typically in the range of −0.14 to −0.44, with −0.30 being a reasonable point estimate, and with the effects being larger after lagged adjustments.
Article
This paper presents evidence on the employment effects of recent minimum wage increases from a pre-specified research design that entailed committing to a detailed set of statistical analyses prior to 'going to' the data. Despite the limited data to which the pre-specified research design can be applied, evidence of disemployment effects of minimum wages is often found where we would most expect it--for younger, less-skilled workers.
Article
We investigate how and why the productivity of a worker varies as a function of the productivity of her co-workers in a group production process. In theory, the introduction of a high productivity worker could lower the effort of incumbent workers because of free riding; or it could increase the effort of incumbent workers because of peer effects induced by social norms, social pressure, or learning. Using scanner level data, we measure high frequency, worker-level productivity of checkers for a large grocery chain. Because of the firm‘s scheduling policy, the timing of within-day changes in personnel is unsystematic, a feature for which we find consistent support in the data. We find strong evidence of positive productivity spillovers from the introduction of highly productive personnel into a shift. A 10% increase in average co-worker permanent productivity is associated with 1.7% increase in a worker’s effort. Most of this peer effect arises from low productivity workers benefiting from the presence of high productivity workers. Therefore, the optimal mix of workers in a given shift is the one that maximizes skill diversity. In order to explain the mechanism that generates the peer effect, we examine whether effort depends on workers’ ability to monitor one another due to their spatial arrangement, and whether effort is affected by the time workers have previously spent working together. We find that a given worker’s effort is positively related to the presence and speed of workers who face him, but not the presence and speed of workers whom he faces (and do not face him). In addition, workers respond more to the presence of co-workers with whom they frequently overlap. These patterns indicate that these individuals are motivated by social pressure and mutual monitoring, and suggest that social preferences can play an important role in inducing effort, even when economic incentives are limited.
Article
Standard sufficient conditions for identification in the regression discontinuity design are continuity of the conditional expectation of counterfactual outcomes in the running variable. These continuity assumptions may not be plausible if agents are able to manipulate the running variable. This paper develops a test of manipulation related to continuity of the running variable density function. The methodology is applied to popular elections to the House of Representatives, where sorting is neither expected nor found, and to roll-call voting in the House, where sorting is both expected and found.
Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects
  • Michael L Anderson
The Role of Theory in Field Experiments
  • David Card
The Registration of Observational Studies -When Metaphors Go Bad
  • Epidemiology
Split-Sample Strategies for Avoiding False Discoveries
  • Michael L Anderson
Registration of observational studies
  • Bmj
Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan
  • Katherine Casey
Using Split Samples to Improve Inference on Causal Effects
  • Marcel Fafchamps
The Oregon Health Insurance Experiment: Evidence from the First Year
  • Amy Finkelstein
Fishing, Commitment, and Communication: A Proposal for Comprehensive Nonbinding Research Registration
  • Macartan Humphreys
Multiple Hypothesis Testing in Experimental Economics
  • John A List
Every Breath You Take -Every Dollar You'll Make: The Long-Term Consequences of the Clean Air Act of 1970
  • Adam Isen
  • Karthik Muralidharan