Article

How can experiments play a greater role in public policy? Twelve proposals from an economic model of scaling

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Policymakers are increasingly turning to insights gained from the experimental method as a means to inform large-scale public policies. Critics view this increased usage as premature, pointing to the fact that many experimentally tested programs fail to deliver their promise at scale. Under this view, the experimental approach drives too much public policy. Yet, if policymakers could be more confident that the original research findings would be delivered at scale, even the staunchest critics would carve out a larger role for experiments to inform policy. Leveraging the economic framework of Al-Ubaydli et al. (2019), we put forward 12 simple proposals, spanning researchers, policymakers, funders and stakeholders, which together tackle the most vexing scalability threats. The framework highlights that only after we deepen our understanding of the scale-up problem will we be on solid ground to argue that scientific experiments should hold a more prominent place in the policymaker's quiver.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Some recent reviews include Della Vigna and Linos (2020), who review the effectiveness of nudge randomised-controlled trials (RCTs) across two so-called 'nudge units;' Beshears and Kosowsky (2020) who review 174 nudge studies to evaluate the average effect size of different nudge strategies; and Jachimowicz et al. (2019), who review 58 studies specifically investigating the default option nudge to determine the effect size associated with this specific nudge. There have also been recent calls to consider experimental practices in choice-architectural design (John 2021), to consider strategies for scaling nudge interventions (Al-Ubaydli et al. 2021), and for widespread adoption of A/B testing methods (Benartzi 2017). ...
Article
Full-text available
Choice architecture describes the environment in which choices are presented to decision-makers. In recent years, public and private actors have looked at choice architecture with great interest as they seek to influence human behaviour. These actors are typically called choice architects. Increasingly, however, this role of architecting choice is not performed by a human choice architect, but an algorithm or artificial intelligence, powered by a stream of Big Data and infused with an objective it has been programmed to maximise. We call this entity the autonomous choice architect. In this paper, we present an account of why artificial intelligence can fulfil the role of a choice architect and why this creates problems of transparency, responsibility and accountability for nudges. We argue that choice architects, be them autonomous computational systems or human-beings, at a most basic level select, from a range of designs, the design which is most likely to maximise a predetermined objective. We then proceed to argue that, given the growing demand for targeted, personalised choice architecture and for faster, dynamic reconfigurations of choice architecture, as well as the ever-expanding pool of data from which feedback can be drawn, the role of the human choice architect is increasingly obscured behind algorithmic, artificially intelligent systems. We provide a discussion of the implications of autonomous choice architects, focusing on the importance of the humans who programme these systems, ultimately arguing that despite technological advances, the responsibility of choice architecture and influence remains firmly one human beings must bear.
... (Abate, Christidis, & Purwanto, 2020;Acs, Åstebro, Audretsch, & Robinson, 2016;AICPA, 2017;AL-UBAYDLI, LEE, LIST, MACKEVICIUS, & SUSKIND, 2021; Cai, 2017;Engström et al., 2020;Fronzaglia, de Moura Júnior, Racy, & Vartanian, 2019;Goldsztejn, Schwartzman, & Nehorai, 2020;Grove, Sanders, Salway, Goyder, & Hampshaw, 2020;Kamradt-Scott & McInnes, 2012;Maier, 2012;Montenegro Martínez, Carmona Montoya, & Franco Giraldo, 2020;Oueslati, 2015;Ribašauskiene et al., 2019;Russo Rafael et al., 2020;TUTER, 2020). The corporate income tax is the only tax that has an immediate and significant role in the markets of any economy. ...
Article
Full-text available
This paper is about the chain of cycle of money. Is analyzed the issue of utility of cycle of money with and without the enforcement savings and/or the escaped savings. Thence, we have a complete theoretical scrutiny of utility of cycle of money. In this paper showed the case of minimization and the case of maximization of cycle of money. Consequently, it has determined the chain of money. This means that we examine the crucial points of tax policy and public policy which are the best for the increase of consumption and of the investments, subject to the case that there exist the enforcement savings and the escaped savings, additionally the case that we have an absence of the enforcement savings, and finally the case that we have omit the escaped savings. Therefore we have an analysis which stands on the utility of the public sector and the utility of the uncontrolled enterprises. Thence, it is plausible to extract conclusions about the utility of cycle of money, showing the points and the behaviors of any economy when there are and when there are not enforcement savings and/or the escaped savings. For the purposes of this analysis is used a simple system of first order derivatives under conditions, and the Karush�-Kuhn-Tucker method.
Article
Full-text available
Worldwide, scholars and public institutions are embracing behavioural insights to improve public policy. Multiple frameworks exist to describe the integration of behavioural insights into policy, and behavioural insights teams (BITs) have specialised in this. Yet, it often remains unclear how these frameworks can be applied and by whom. Here, we describe and discuss a comprehensive framework that describes who does what and when to integrate behavioural insights into policy. The framework is informed by relevant literature, theorising, and experience with one BIT, the Behavioural Insights Group Rotterdam. We discuss how the framework helps to overcome some challenges associated with integrating behavioural insights into policy (an overreliance on randomised control trials, a limited understanding of context, threats to good scientific practice, and bounded rationality of individuals applying behavioural insights).
Article
The goal of creating evidence-based programs is to scale them at sufficient breadth to support population-level improvements in critical outcomes. However, this promise is challenging to fulfill. One of the biggest issues for the field is the reduction in effect sizes seen when a program is taken to scale. This paper discusses an economic perspective that identifies the underlying incentives in the research process that lead to scale up problems and to deliver potential solutions to strengthen outcomes at scale. The principles of open science are well aligned with this goal. One prevention program that has begun to scale across the USA is early childhood home visiting. While there is substantial impact research on home visiting, overall average effect size is .10 and a recent national randomized trial found attenuated effect sizes in programs implemented under real-world conditions. The paper concludes with a case study of the relevance of the economic model and open science in developing and scaling evidence-based home visiting. The case study considers how the traditional approach for testing interventions has influenced home visiting’s evolution to date and how open science practices could have supported efforts to maintain impacts while scaling home visiting. It concludes by considering how open science can accelerate the refinement and scaling of home visiting interventions going forward, through accelerated translation of research into policy and practice.
Article
Full-text available
This paper focuses on policies that are enlightened by behavioural insights (BIs), taking decision-makers’ biases and use of heuristics into account and utilising a people-centric perspective and full acknowledgement of context dependency. Considering both the environmental and pandemic crises, it sketches the goal of resilient food systems and describes the contours of behavioural food policy. Conceptually built on BIs derived from behavioural economics, consumer research and decision science, such an approach systematically uses behavioural policies where appropriate and most cost-effective. BI informed tools (nudges) can be employed as stand-alone instruments (such as defaults) or used to improve the effectiveness of traditional policy tools.
Article
Nowadays, academic journals of high standing rarely accept a conceptual idea in a paper not instantly accompanied by econometric estimates. The idea would almost certainly get rejected. Empirical validation based on past statistical data has produced an unfortunate backward orientation in economics. While one can learn from the past, this approach fails when the underlying conditions strongly change. The paper suggests various possibilities to overcome the intense publication pressure in so‐called top journals and the overemphasis on instant empirical evidence. Academia is, however, unlikely to adapt. As economics is too backward oriented, other disciplines or cranks may well dominate future economic policy.
Article
The promise of randomized controlled trials is that evidence gathered through the evaluation of a specific program helps us—possibly after several rounds of fine-tuning and multiple replications in different contexts—to inform policy. However, critics have pointed out that a potential constraint in this agenda is that results from small “proof-of-concept” studies run by nongovernment organizations may not apply to policies that can be implemented by governments on a large scale. After discussing the potential issues, this paper describes the journey from the original concept to the design and evaluation of scalable policy. We do so by evaluating a series of strategies that aim to integrate the nongovernment organization Pratham’s “Teaching at the Right Level” methodology into elementary schools in India. The methodology consists of reorganizing instruction based on children’s actual learning levels, rather than on a prescribed syllabus, and has previously been shown to be very effective when properly implemented. We present evidence from randomized controlled trials involving some designs that failed to produce impacts within the regular schooling system but still helped shape subsequent versions of the program. As a result of this process, two versions of the programs were developed that successfully raised children’s learning levels using scalable models in government schools. We use this example to draw general lessons about using randomized control trials to design scalable policies.
Article
Full-text available
We analyse how agricultural extension can be made more effective in terms of increasing farmers' adoption of pro-nutrition technologies, such as biofortified crops. In a randomised controlled trial with farmers in Kenya, we implemented several extension treatments and evaluated their effects on the adoption of beans biofortified with iron and zinc. Difference-indifference estimates show that intensive agricultural training can increase technology adoption considerably. Additional nutrition training helps farmers to better appreciate the technology's nutritional benefits and thus further increases adoption. This study is among the first to analyse how improved extension designs can help to make smallholder farming more nutrition-sensitive.
Article
Full-text available
Author summary Preclinical animal research is mostly based on studies conducted in a single laboratory and under highly standardized conditions. This entails the risk that the study results may only be valid under the specific conditions of the test laboratory, which may explain the poor reproducibility of preclinical animal research. To test this hypothesis, we used simulations based on 440 preclinical studies across 13 different interventions in animal models of stroke, myocardial infarction, and breast cancer and compared the reproducibility of results between single-laboratory and multi-laboratory studies. To simulate multi-laboratory studies, we combined data from multiple studies, as if several collaborating laboratories had conducted them in parallel. We found that single-laboratory studies produced large variation between study results. By contrast, multi-laboratory studies including as few as 2 to 4 laboratories produced much more consistent results, thereby increasing reproducibility without a need for larger sample sizes. Our findings demonstrate that excessive standardization is a source of poor reproducibility because it ignores biologically meaningful variation. We conclude that multi-laboratory studies—and potentially other ways of creating more heterogeneous study samples—provide an effective means of improving the reproducibility of study results, which is crucial to prevent wasting animals and resources for inconclusive research.
Article
Full-text available
Randomized Controlled Trials (RCTs) are increasingly popular in the social sciences, not only in medicine. We argue that the lay public, and sometimes researchers, put too much trust in RCTs over other methods of investigation. Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates. Finding out whether an estimate was generated by chance is more difficult than commonly believed. At best, an RCT yields an unbiased estimate, but this property is of limited practical value. Even then, estimates apply only to the sample selected for the trial, often no more than a convenience sample, and justification is required to extend the results to other groups, including any population to which the trial sample belongs, or to any individual, including an individual in the trial. Demanding 'external validity' is unhelpful because it expects too much of an RCT while undervaluing its potential contribution. RCTs do indeed require minimal assumptions and can operate with little prior knowledge. This is an advantage when persuading distrustful audiences, but it is a disadvantage for cumulative scientific progress, where prior knowledge should be built upon, not discarded. RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program, combining with other methods, including conceptual and theoretical development, to discover not 'what works', but 'why things work'.
Article
Full-text available
Across the social sciences, growing concerns about research transparency have led to calls for pre-analysis plans (PAPs) that specify in advance how researchers intend to analyze the data they are about to gather. PAPs promote transparency and credibility by helping readers distinguish between exploratory and confirmatory analyses. However, PAPs are time-consuming to write and may fail to anticipate contingencies that arise in the course of data collection. This article proposes the use of “standard operating procedures” (SOPs)—default practices to guide decisions when issues arise that were not anticipated in the PAP. We offer an example of an SOP that can be adapted by other researchers seeking a safety net to support their PAPs.
Article
Full-text available
Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.
Article
Full-text available
Objective: Clinical trials have long been considered the ‘gold standard’ of research generated evidence in health care. Patient recruitment is an important determinant in the success of the trials, yet little focus is placed on the decision making process of patients towards recruitment. Our objective was to identify the key factors pertaining to patient participation in clinical trials, to better understand the identified low participation rate of patients in one clinical research facility within Ireland. Design: Narrative literature review of studies focussing on factors which may act to facilitate or deter patient participation in clinical trials. Studies were identified from Medline, PubMed, Cochrane Library and CINAHL. Results: Sixty-one studies were included in the narrative review: Forty-eight of these papers focused specifically on the patient's perspective of participating in clinical trials. The remaining thirteen related to carers, family and health care professional perspectives of participation. The primary factor influencing participation in clinical trials amongst patients was related to personal factors and these were collectively associated with obtaining a form of personal gain through participation. Cancer was identified as the leading disease entity included in clinical trials followed by HIV and cardiovascular disease. Conclusion: The vast majority of literature relating to participation in clinical trials emanates predominantly from high income countries, with 63% originating from the USA. No studies for inclusion in this review were identified from low income or developing countries and therefore limits the generalizability of the influencing factors.
Article
Full-text available
Major advances in population health will not occur unless we translate existing knowledge into effective multicomponent interventions, implement and maintain these in communities, and develop rigorous translational research and evaluation methods to ensure continual improvement and sustainability. We discuss challenges and offer approaches to evaluation that are key for translational research stages 3 to 5 to advance optimized adoption, implementation, and maintenance of effective and replicable multicomponent strategies. The major challenges we discuss concern (a) multiple contexts of evaluation/research, (b) complexity of packages of interventions, and (c) phases of evaluation/research questions. We suggest multiple alternative research designs that maintain rigor but accommodate these challenges and highlight the need for measurement systems. Longitudinal data collection and a standardized continuous measurement system are fundamental to the evaluation and refinement of complex multicomponent interventions. To be useful to T3–T5 translational research efforts in neighborhoods and communities, such a system would include assessments of the reach, implementation, effects on immediate outcomes, and effects of the comprehensive intervention package on more distal health outcomes.
Article
Full-text available
This report shows that Knowledge Is Power Program (KIPP) middle schools have significant and substantial positive impacts on student achievement in four core academic subjects: reading, math, science, and social studies. One of the report’s analyses confirms the positive impacts using a rigorous randomized experimental analysis that relies on the schools’ admissions lotteries to identify comparison students, thereby accounting for students’ prior achievement, as well as factors such as student and parent motivation. The latest findings from Mathematica’s multiyear study of KIPP middle schools, the report is the most rigorous large-scale evaluation of KIPP charter schools to date, covering 43 KIPP middle schools in 13 states and the District of Columbia. Student outcomes examined included state test results in reading and math, test scores in science and social studies, results on a nationally normed assessment that includes measures of higher-order thinking, and behaviors reported by students and parents.
Article
Full-text available
Background: The movement of evidence-based practices (EBPs) into routine clinical usage is not spontaneous, but requires focused efforts. The field of implementation science has developed to facilitate the spread of EBPs, including both psychosocial and medical interventions for mental and physical health concerns. Discussion: The authors aim to introduce implementation science principles to non-specialist investigators, administrators, and policymakers seeking to become familiar with this emerging field. This introduction is based on published literature and the authors' experience as researchers in the field, as well as extensive service as implementation science grant reviewers. Implementation science is "the scientific study of methods to promote the systematic uptake of research findings and other EBPs into routine practice, and, hence, to improve the quality and effectiveness of health services." Implementation science is distinct from, but shares characteristics with, both quality improvement and dissemination methods. Implementation studies can be either assess naturalistic variability or measure change in response to planned intervention. Implementation studies typically employ mixed quantitative-qualitative designs, identifying factors that impact uptake across multiple levels, including patient, provider, clinic, facility, organization, and often the broader community and policy environment. Accordingly, implementation science requires a solid grounding in theory and the involvement of trans-disciplinary research teams. The business case for implementation science is clear: As healthcare systems work under increasingly dynamic and resource-constrained conditions, evidence-based strategies are essential in order to ensure that research investments maximize healthcare value and improve public health. Implementation science plays a critical role in supporting these efforts.
Article
Full-text available
A decade ago, the Society of Prevention Research (SPR) endorsed a set of standards for evidence related to research on prevention interventions. These standards (Flay et al., Prevention Science 6:151-175, 2005) were intended in part to increase consistency in reviews of prevention research that often generated disparate lists of effective interventions due to the application of different standards for what was considered to be necessary to demonstrate effectiveness. In 2013, SPR's Board of Directors decided that the field has progressed sufficiently to warrant a review and, if necessary, publication of "the next generation" of standards of evidence. The Board convened a committee to review and update the standards. This article reports on the results of this committee's deliberations, summarizing changes made to the earlier standards and explaining the rationale for each change. The SPR Board of Directors endorses "The Standards of Evidence for Efficacy, Effectiveness, and Scale-up Research in Prevention Science: Next Generation."
Article
Full-text available
To test the hypothesis that the percentage of patients screened that randomize differs between prevention and therapy trials. Rapid review of randomized controlled trials (RCTs) identified through published systematic reviews in August 2013. Individually randomized, parallel group controlled RCTs were eligible if they evaluated metformin monotherapy or exercise for the prevention or treatment of type 2 diabetes. Numbers of patients screened and randomized were extracted by a single reviewer. Percentages were calculated for each study for those randomized: as a function of those approached, screened, and eligible. Percentages (95% confidence intervals) from each individual study were weighted according to the denominator and pooled rates calculated. Statistical heterogeneity was assessed using I(2). The percentage of those screened who subsequently randomized was 6.2% (6.0%, 6.4%; 3 studies, I(2) = 100.0%) for metformin prevention trials; 50.7% (49.9%, 51.4%; 21 studies, I(2) = 99.6%) for metformin treatment trials; 4.8% (4.7%, 4.8%; 14 studies, I(2) = 99.9%) for exercise prevention trials; and 43.3% (42.6%, 43.9%; 28 studies, I(2) = 99.8%) for exercise treatment trials. This study provides qualified support for the hypothesis that prevention trials recruit a smaller proportion of those screened than treatment trials. Statistical heterogeneity associated with pooled estimates and other study limitations is discussed. Copyright © 2014 Elsevier Inc. All rights reserved.
Article
Impact evaluations can help to inform policy decisions, but they are rooted in particular contexts and to what extent they generalize is an open question. I exploit a new data set of impact evaluation results and find a large amount of effect heterogeneity. Effect sizes vary systematically with study characteristics, with government-implemented programs having smaller effect sizes than academic or non-governmental organization-implemented programs, even controlling for sample size. I show that treatment effect heterogeneity can be appreciably reduced by taking study characteristics into account.
Article
The promise of randomized controlled trials is that evidence gathered through the evaluation of a specific program helps us—possibly after several rounds of fine-tuning and multiple replications in different contexts—to inform policy. However, critics have pointed out that a potential constraint in this agenda is that results from small “proof-of-concept” studies run by nongovernment organizations may not apply to policies that can be implemented by governments on a large scale. After discussing the potential issues, this paper describes the journey from the original concept to the design and evaluation of scalable policy. We do so by evaluating a series of strategies that aim to integrate the nongovernment organization Pratham’s “Teaching at the Right Level” methodology into elementary schools in India. The methodology consists of reorganizing instruction based on children’s actual learning levels, rather than on a prescribed syllabus, and has previously been shown to be very effective when properly implemented. We present evidence from randomized controlled trials involving some designs that failed to produce impacts within the regular schooling system but still helped shape subsequent versions of the program. As a result of this process, two versions of the programs were developed that successfully raised children’s learning levels using scalable models in government schools. We use this example to draw general lessons about using randomized control trials to design scalable policies.
Article
We embed a field experiment in a nationwide recruitment drive for a new healthcare position in Zambia to test whether career benefits attract talent at the expense of prosocial motivation. In line with common wisdom, offering career opportunities attracts less prosocial applicants. However, the trade-off only exists at low levels of talent; the marginal applicants in treatment are more talented and equally prosocial. These are hired, and perform better at every step of the causal chain: they provide more inputs, increase facility utilization, and improve health outcomes including a 25% decrease in child malnutrition.
Article
What was once broadly viewed as an impossibility—learning from experimental data in economics—has now become commonplace. Governmental bodies, think tanks, and corporations around the world employ teams of experimental researchers to answer their most pressing questions. For their part, in the past two decades academics have begun to more actively partner with organizations to generate data via field experimentation. Although this revolution in evidence‐based approaches has served to deepen the economic science, recently a credibility crisis has caused even the most ardent experimental proponents to pause. This study takes a step back from the burgeoning experimental literature and introduces 12 actions that might help to alleviate this credibility crisis and raise experimental economics to an even higher level. In this way, we view our “12 action wish list” as discussion points to enrich the field.
Article
Media censorship is a hallmark of authoritarian regimes. We conduct a field experiment in China to measure the effects of providing citizens with access to an uncensored internet. We track subjects’ media consumption, beliefs regarding the media, economic beliefs, political attitudes, and behaviors over 18 months. We find four main results: (i) free access alone does not induce subjects to acquire politically sensitive information; (ii) temporary encouragement leads to a persistent increase in acquisition, indicating that demand is not permanently low; (iii) acquisition brings broad, substantial, and persistent changes to knowledge, beliefs, attitudes, and intended behaviors; and (iv) social transmission of information is statistically significant but small in magnitude. We calibrate a simple model to show that the combination of low demand for uncensored information and the moderate social transmission means China’s censorship apparatus may remain robust to a large number of citizens receiving access to an uncensored internet. (JEL C93, D72, D83, L82, L86, L88, P36)
Article
The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.
Article
There is growing interest in enhancing research transparency and reproducibility in economics and other scientific fields. We survey existing work on these topics within economics and discuss the evidence suggesting that publication bias, inability to replicate, and specification searching remain widespread in the discipline. We next discuss recent progress in this area, including through improved research design, study registration and pre-analysis plans, disclosure standards, and open sharing of data and materials, drawing on experiences in both economics and other social sciences. We discuss areas where consensus is emerging on new practices, as well as approaches that remain controversial, and speculate about the most effective ways to make economics research more credible in the future.
Article
Some empirical results are more likely to be published than others. Such selective publication leads to biased estimates and distorted inference. This paper proposes two approaches for identifying the conditional probability of publication as a function of a study's results, the first based on systematic replication studies and the second based on meta-studies. For known conditional publication probabilities, we propose median-unbiased estimators and associated confidence sets that correct for selective publication. We apply our methods to recent large-scale replication studies in experimental economics and psychology, and to meta-studies of the effects of minimum wages and de-worming programs.
Article
This paper makes the case for greater use of randomized experiments “at scale.” We review various critiques of experimental program evaluation in developing countries, and discuss how experimenting at scale along three specific dimensions—the size of the sampling frame, the number of units treated, and the size of the unit of randomization— can help alleviate the concerns raised. We find that program-evaluation randomized controlled trials published over the last 15 years have typically been “small” in these senses, but also identify a number of examples—including from our own work— demonstrating that experimentation at much larger scales is both feasible and valuable.
Article
Economists often conduct experiments that demonstrate the benefits to individuals of modifying their behavior, such as using a new production process at work or investing in energy saving technologies. A common occurrence is for the success of the intervention in these small-scale studies to diminish substantially when applied at a larger scale, severely undermining the optimism advertised in the original research studies. One key contributor to the lack of general success is that the change that has been demonstrated to be beneficial is not adopted to the extent that would be optimal. This problem is isomorphic to the problem of patient non-adherence to medications that are known to be effective. The large medical literature on countermeasures furnishes economists with potential remedies to this manifestation of the scaling problem.
Article
Randomized trials play an important role in estimating the effect of a policy or social work program in a given population. While most trial designs benefit from strong internal validity, they often lack external validity, or generalizability, to the target population of interest. In other words, one can obtain an unbiased estimate of the study sample average treatment effect from a randomized trial; however, this estimate may not equal the target population average treatment effect if the study sample is not fully representative of the target population. This article provides an overview of existing strategies to assess and improve upon the generalizability of randomized trials, both through statistical methods and study design, as well as recommendations on how to implement these ideas in social work research.
Article
Purpose Whether the ASCO Value Framework and the European Society for Medical Oncology (ESMO) Magnitude of Clinical Benefit Scale (MCBS) measure similar constructs of clinical benefit is unclear. It is also unclear how they relate to quality-adjusted life-years (QALYs) and funding recommendations in the United Kingdom and Canada. Methods Randomized clinical trials of oncology drug approvals by the US Food and Drug Administration, European Medicines Agency, and Health Canada between 2006 and August 2015 were identified and scored using the ASCO version 1 (v1) framework, ASCO version 2 (v2) framework, and ESMO-MCBS by at least two independent reviewers. Spearman correlation coefficients were calculated to assess construct (between frameworks) and criterion validity (against QALYs from the National Institute for Health and Care Excellence [NICE] and the pan-Canadian Oncology Drug Review [pCODR]). Associations between scores and NICE/pCODR recommendations were examined. Inter-rater reliability was assessed using intraclass correlation coefficients. Results From 109 included randomized clinical trials, 108 ASCOv1, 111 ASCOv2, and 83 ESMO scores were determined. Correlation coefficients for ASCOv1 versus ESMO, ASCOv2 versus ESMO, and ASCOv1 versus ASCOv2 were 0.36 (95% CI, 0.15 to 0.54), 0.17 (95% CI, −0.06 to 0.37), and 0.50 (95% CI, 0.35 to 0.63), respectively. Compared with NICE QALYs, correlation coefficients were 0.45 (ASCOv1), 0.53 (ASCOv2), and 0.46 (ESMO); with pCODR QALYs, coefficients were 0.19 (ASCOv1), 0.20 (ASCOv2), and 0.36 (ESMO). None of the frameworks were significantly associated with NICE/pCODR recommendations. Inter-rater reliability was good for all frameworks. Conclusion The weak-to-moderate correlations of the ASCO frameworks with the ESMO-MCBS, as well as their correlations with QALYs and with NICE/pCODR funding recommendations, suggest different constructs of clinical benefit measured. Construct convergent validity with the ESMO-MCBS did not increase with the updated ASCO framework.
Article
Policymakers often consider interventions at the scale of the population, or some other large scale. One of the sources of information about the potential effects of such interventions is experimental studies conducted at a significantly smaller scale. A common occurrence is for the treatment effects detected in these small-scale studies to diminish substantially in size when applied at the larger scale that is of interest to policymakers. This paper provides an overview of the main reasons for a breakdown in scalability. Understanding the principal mechanisms represents a first step toward formulating countermeasures that promote scalability.
Article
Multisite trials, in which individuals are randomly assigned to alternative treatment arms within sites, offer an excellent opportunity to estimate the cross-site average effect of treatment assignment (intent to treat or ITT) and the amount by which this impact varies across sites. Although both of these statistics are substantively and methodologically important, only the first has been well studied. To help fill this information gap, we estimate the cross-site standard deviation of ITT effects for a broad range of education and workforce development interventions using data from 16 large multisite randomized control trials. We use these findings to explore hypotheses about factors that predict the magnitude of cross-site impact variation, and we consider the implications of this variation for the statistical precision of multisite trials.
Article
The promise of randomized controlled trials (RCTs) is that evidence gathered through the evaluation of a specific program helps us—possibly after several rounds of fine-tuning and multiple replications in different contexts—to inform policy. However, critics have pointed out that a potential constraint in this agenda is that results from small, NGO-run “proof-of-concept” studies may not apply to policies that can be implemented by governments on a large scale. After discussing the potential issues, this paper describes the journey from the original concept to the design and evaluation of scalable policy. We do so by evaluating a series of strategies that aim to integrate the NGO Pratham’s “Teaching at the Right Level” methodology into elementary schools in India. The methodology consists of re-organizing instruction based on children’s actual learning levels, rather than on a prescribed syllabus, and has previously been shown to be very effective when properly implemented. We present RCT evidence on the designs that failed to produce impacts within the regular schooling system but helped shape subsequent versions of the program. As a result of this process, two versions of the programs were developed that successfully raised children’s learning levels using scalable models in government schools.
Chapter
A number of critical innovations spurred the rapid expansion in the use of field experiments by academics. Some of these were econometric but many were intensely practical. Researchers learned how to work with a wide range of implementing organizations from small, local nongovernmental organizations to large government bureaucracies. They improved data collection techniques and switched to digital data collection. As researchers got more involved in the design and implementation of the interventions they tested, new ethical issues arose. Finally, the dramatic rise in the use of experiments increased the benefits associated with research transparency. This chapter records some of these practical innovations. It focuses on how to select and effectively work with the organization running an intervention which is being evaluated; ways to minimize attrition, monitor enumerators, and ensure data are collected consistently in treatment and comparison areas; practical ethical issues such as when to start the ethics approval process; and research transparency, including how to prevent publication bias and data mining and the role of experimental registries, preanalysis plans, data publication reanalysis, and replication efforts.
Article
Although randomized experiments are lauded for their high internal validity, they have been criticized for the limited external validity of their results. This chapter describes research strategies for investigating how much nonrepresentative site selection may limit external validity and bias impact findings. The magnitude of external validity bias is potentially much larger than what is thought of as an acceptable level of internal validity bias. The chapter argues that external validity bias should always be investigated by the best available means and addressed directly when presenting evaluation results. These observations flag the importance of making external validity a priority in evaluation planning.
Article
This paper empirically evaluates the cost-effectiveness of Head Start, the largest early-childhood education program in the United States. Using data from the Head Start Impact Study (HSIS), we show that Head Start draws roughly a third of its participants from competing preschool programs, many of which receive public funds. Accounting for the public savings associated with reduced enrollment in other subsidized preschools substantially increases estimates of the program's rate of return. To parse Head Start's test score impacts relative to home care and competing preschools, we selection correct test scores in each care environment using excluded interactions between experimental offer status and household characteristics. We find that Head Start's effects are greater for children who would not otherwise attend preschool and for children that are less likely to participate in the program.
Article
Given increasing interest in evidence-based policy, there is growing attention to how well the results from rigorous program evaluations may inform policy decisions. However, little attention has been paid to documenting the characteristics of schools or districts that participate in rigorous educational evaluations, and how they compare to potential target populations for the interventions that were evaluated. Utilizing a list of the actual districts that participated in 11 large-scale rigorous educational evaluations, we compare those districts to several different target populations of districts that could potentially be affected by policy decisions regarding the interventions under study. We find that school districts that participated in the 11 rigorous educational evaluations differ from the interventions' target populations in several ways, including size, student performance on state assessments, and location (urban/rural). These findings raise questions about whether, as currently implemented, the results from rigorous impact studies in education are likely to generalize to the larger set of school districts—and thus schools and students—of potential interest to policymakers, and how we can improve our study designs to retain strong internal validity while also enhancing external validity.
Article
The reproducibility of scientific findings has been called into question. To contribute data about reproducibility in economics, we replicate 18 studies published in the American Economic Review and the Quarterly Journal of Economics in 2011-2014. All replications follow predefined analysis plans publicly posted prior to the replications, and have a statistical power of at least 90% to detect the original effect size at the 5% significance level. We find a significant effect in the same direction as the original study for 11 replications (61%); on average the replicated effect size is 66% of the original. The reproducibility rate varies between 67% and 78% for four additional reproducibility indicators, including a prediction market measure of peer beliefs.
Article
This paper investigates the effects of California’s billion-dollar class-size-reduction program on student achievement. It uses year-to-year differences in class size generated by variation in enrollment and the state’s class-size-reduction program to identify both the direct effects of smaller classes and related changes in teacher quality. Although the results show that smaller classes raised mathematics and reading achievement, they also show that the increase in the share of teachers with neither prior experience nor full certification dampened the benefits of smaller classes, particularly in schools with high shares of economically disadvantaged, minority students.
Article
Some researchers have argued that anchoring in economic valuations casts doubt on the assumption of consistent and stable preferences. We present new evidence that explores the strength of certain anchoring results. We then present a theoretical framework that provides insights into why we should be cautious of initial empirical findings in general. The model importantly highlights that the rate of false positives depends not only on the observed significance level, but also on statistical power, research priors, and the number of scholars exploring the question. Importantly, a few independent replications dramatically increase the chances that the original finding is true.
Article
The present article provides a synthesis of the conceptual and statistical issues involved in using multisite randomized trials to learn about and from a distribution of heterogeneous program impacts across individuals and/or program sites. Learning about such a distribution involves estimating its mean value, detecting and quantifying its variation, and estimating site-specific impacts. Learning from such a distribution involves studying the factors that predict or explain impact variation. Part I of the article introduces the concepts and issues involved. Part II focuses on estimating the mean and variation of impacts of program assignment. Part III extends the discussion to variation in the impacts of program participation. Part IV considers how to use multisite trials to study moderators of program impacts (individual-level or site-level factors that influence these impacts) and mediators of program impacts (individual-level or site-level “mechanisms” that produce these impacts).
Article
Randomized controlled trials (RCT) have gained ground as the dominant tool for studying policy interventions in many fields of applied economics. We analyze theoretically encouragement and resentful demoralization in RCTs and show that these might be rooted in the same behavioral trait - people's propensity to act reciprocally. When people are motivated by reciprocity, the choice of assignment procedure influences the RCTs’ findings. We show that even credible and explicit randomization procedures do not guarantee an unbiased prediction of the impact of policy interventions; however, they minimize any bias relative to other less transparent assignment procedures. Keywords: Randomized controlled trials, Policy experiments, Internal validity, Procedural concerns, Psychological game theory.This article is protected by copyright. All rights reserved.
Article
The revised Society for Prevention Research (SPR) standards of evidence are an exciting advance in the field of prevention science. We appreciate the committee's vision that the standards represent goals to aspire to rather than a set of benchmarks for where prevention science is currently. The discussion about the standards highlights how much has changed in the field over the last 10 years and as knowledge, theory, and methods continue to advance, the new standards push the field toward increasing rigor and relevance. This commentary discusses how the revised standards support work of translating high-quality evaluations to support evidence-based policy and work supporting evidence-based programs' ability to implement at scale. The commentary ends by raising two areas, generating evidence at scale and transparency of research, as additional areas for consideration in future standards.
Article
Statistical power analysis provides the conventional approach to assess error rates when designing a research study. However, power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations in which (a) the probability of an estimate being in the wrong direction (Type S [sign] error) and (b) the factor by which the magnitude of an effect might be overestimated (Type M [magnitude] error or exaggeration ratio) are estimated. We illustrate with examples from recent published research and discuss the largest challenge in a design calculation: coming up with reasonable estimates of plausible effect sizes based on external information. © The Author(s) 2014.
Article
Intestinal helminths—including hookworm, roundworm, whipworm, and schistosomiasis—infect more than one-quarter of the world's population. Studies in which medical treatment is randomized at the individual level potentially doubly underestimate the benefits of treatment, missing externality benefits to the comparison group from reduced disease transmission, and therefore also underestimating benefits for the treatment group. We evaluate a Kenyan project in which school-based mass treatment with deworming drugs was randomly phased into schools, rather than to individuals, allowing estimation of overall program effects. The program reduced school absenteeism in treatment schools by one-quarter, and was far cheaper than alternative ways of boosting school participation. Deworming substantially improved health and school participation among untreated children in both treatment schools and neighboring schools, and these externalities are large enough to justify fully subsidizing treatment. Yet we do not find evidence that deworming improved academic test scores.
Article
“Site selection bias” can occur when the probability that a program is adopted or evaluated is correlated with its impacts. I test for site selection bias in the context of the Opower energy conservation programs, using 111 randomized control trials involving 8.6 million households across the United States. Predictions based on rich microdata from the first 10 replications substantially overstate efficacy in the next 101 sites. Several mechanisms caused this positive selection. For example, utilities in more environmentalist areas are more likely to adopt the program, and their customers are more responsive to the treatment. Also, because utilities initially target treatment at higher-usage consumer subpopulations, efficacy drops as the program is later expanded. The results illustrate how program evaluations can still give systematically biased out-of-sample predictions, even after many replications. JEL Codes: C93, D12, L94, O12, Q41.
Article
We present experimental evidence on the impact of a school choice program in the Indian state of Andhra Pradesh (AP) that provided students with a voucher to finance attending a private school of their choice. The study design featured a unique two-stage lottery-based allocation of vouchers that created both a student-level and a market-level experiment, which allows us to study both the individual and the aggregate effects of school choice (including spillovers). After two and four years of the program, we find no difference between test scores of lottery winners and losers on Telugu (native language), math, English, and science/social studies, suggesting that the large cross-sectional differences in test scores across public and private schools mostly reflect omitted variables. However, private schools also teach Hindi, which is not taught by the public schools, and lottery winners have much higher test scores in Hindi. Further, the mean cost per student in the private schools in our sample was less than one-third of the cost in public schools. Thus, private schools in this setting deliver slightly better test score gains than their public counterparts (better on Hindi and same in other subjects), and do so at a substantially lower cost per student. Finally, we find no evidence of spillovers on public-school students who do not apply for the voucher, or on private school students, suggesting that the positive impacts on voucher winners did not come at the expense of other students.
Article
Scaling of evidence-based practices in education has received extensive discussion but little empirical evaluation. We present here a descriptive summary of the experience from seven states with a history of implementing and scaling School-Wide Positive Behavioral Interventions and Supports (SWPBIS) over the past decade. Each state has been successful in establishing at least 500 schools using SWPBIS across approximately a third or more of the schools in their state. The implementation elements proposed by Sugai, Horner, and Lewis (2009) and the stages of implementation described by Fixsen, Naoom, Blase, Friedman, and Wallace (2005) were used within a survey with each element assessed at each stage by the SWPBIS coordinators and policy makers in the seven states. Consistent themes from analysis of the responses were defined and confirmed with the surveyed participants. Results point to four central areas of state "capacity" as being perceived as critical for a state to move SWPBIS to scale (administrative leadership and funding, local training and coaching capacity, behavioral expertise, and local evaluation capacity), and an iterative process in which initial implementation success (100-200 demonstrations) is needed to recruit the political and fiscal support required for larger scaling efforts.
Article
This paper provides a method to infer the presence of treatment spillovers within markets where a fraction of agents is treated. We model individual outcomes as functions of the assigned treatment status and the distribution of assigned treatments in a market. We develop a two-step identification and estimation method, focusing first on the treatment distribution among individuals within markets and then on the treatment distribution across markets. We apply our approach to training programs for unemployed individuals in France using rich administrative data. Our results provide evidence of interactions within local labor markets as potential individual outcomes vary with the proportion of treated individuals. © 2014 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.
Article
Causal evidence on microcredit impacts informs theory, practice, and debates about its effectiveness as a development tool. The six randomized evaluations in this volume use a variety of sampling, data collection, experimental design, and econometric strategies to identify causal effects of expanded access to microcredit on borrowers and/or communities. These methods are deployed across an impressive range of locations-six countries on four continents, urban and rural areas-borrower characteristics, loan characteristics, and lender characteristics. Summarizing and interpreting results across studies, we note a consistent pattern of modestly positive, but not transformative, effects. We also discuss directions for future research.