Larry V HedgesNorthwestern University | NU · Department of Statistics
Larry V Hedges
PhD
About
348
Publications
240,825
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
110,650
Citations
Introduction
I am an applied statistician working primarily in education and the social sciences. My work concerns meta-analysis, design of evaluation studies, and statistical aspects of replication in science.
Skills and Expertise
Additional affiliations
September 1980 - September 2005
September 2005 - present
Publications
Publications (348)
The standardized mean difference (sometimes called Cohen’s d) is an effect size measure widely used to describe the outcomes of experiments. It is mathematically natural to describe differences between groups of data that are normally distributed with different means but the same standard deviation. In that context, it can be interpreted as determi...
Well-chosen covariates boost the design sensitivity of individually and cluster-randomized trials. We provide guidance on covariate selection generating an extensive compilation of single- and multilevel design parameters on student achievement. Embedded in psychometric heuristics, we analyzed (a) covariate types of varying bandwidth-fidelity, name...
Single case experimental designs are an important research design in behavioral and medical research. Although there are design standards prescribed by the What Works Clearinghouse for single case experimental designs, these standards do not include statistically derived power computations. Recently we derived the equations for computing power for...
Well-chosen covariates boost the design sensitivity of individually and cluster-randomized trials. We provide guidance on covariate selection generating an extensive compilation of single- and multilevel design parameters on student achievement. Embedded in psychometric heuristics, we analyzed (a) covariate types of varying bandwidth-fidelity, name...
Conventional random‐effects models in meta‐analysis rely on large sample approximations instead of exact small sample results. While random‐effects methods produce efficient estimates and confidence intervals for the summary effect have correct coverage when the number of studies is sufficiently large, we demonstrate that conventional methods resul...
N-of-1 trials, a special case of Single Case Experimental Designs (SCEDs), are prominent in clinical medical research and specifically psychiatry due to the growing significance of precision/personalized medicine. It is imperative that these clinical trials be conducted, and their data analyzed, using the highest standards to guard against threats...
DESCRIPTION
This chapter examines the literature on interventions in physics education through the lens of optimizing and accelerating knowledge accumulation. Specifically, intervention research in physics education is discussed in terms of the prevalence of randomized designs and meta-analyses of effects from similar interventions. The authors mak...
It is common practice in both randomized and quasi‐experiments to adjust for baseline characteristics when estimating the average effect of an intervention. The inclusion of a pre‐test, for example, can reduce both the standard error of this estimate and—in non‐randomized designs—its bias. At the same time, it is also standard to report the effect...
Currently, the design standards for single-case experimental designs (SCEDs) are based on validity considerations as prescribed by the What Works Clearinghouse. However, there is a need for design considerations such as power based on statistical analyses. We compute and derive power using computations for (AB)k designs with multiple cases which ar...
Multisite field experiments using the (generalized) randomized block design that assign treatments to individuals within sites are common in education and the social sciences. Under this design, there are two possible estimands of interest and they differ based on whether sites or blocks have fixed or random effects. When the average treatment effe...
Descriptive analyses of socially important or theoretically interesting phenomena and trends are a vital component of research in the behavioral, social, economic, and health sciences. Such analyses yield reliable results when using representative individual participant data (IPD) from studies with complex survey designs, including educational larg...
The education landscape in the United States has been changing rapidly in recent decades: student populations have become more diverse; there has been an explosion of data sources; there is an intensified focus on diversity, equity, inclusion, and accessibility; educators and policy makers at all levels want more and better data for evidence-based...
The rise of multi-site, field-based trials in early childhood research coupled with advances in statistics offer an unprecedented opportunity to understand how context affects children's wellbeing. In the current study, we chart our journey in exploring heterogeneity in the treatment effects of an existing large-scale evaluation to provide guidance...
Descriptive analyses of educational phenomena are a vital component of educational research. Such analyses yield reliable results when using representative individual participant data (IPD) from educational large-scale assessments (ELSAs). The meta-analytic integration of these results offers unique and novel research opportunities to provide stron...
To determine “what works, for whom, and under what conditions,” interventions need to be studied in diverse and heterogeneous samples. At an international scope, this degree of heterogeneity is unlikely in a single study and instead requires conducting multiple studies of the same intervention across the globe. In this paper, we provide an overview...
This chapter provides practical advice about how to think about heterogeneity. It highlights the prediction interval, the statistic that reports the range of true effects. This statistic provides the information that we need, and that many think is being provided by the other statistics. The forest plot of a meta‐analysis typically includes a line...
This chapter introduces the fixed‐effect model. It discusses the assumptions of this model and shows how these are reflected in the formulas used to compute a summary effect, and in the meaning of the summary effect. All factors that could influence the effect size are the same in all the studies, and therefore the true effect size is the same (hen...
The basic idea of meta‐analysis is to compute an effect size from each of several studies, and to calculate a weighted average of these effect size estimates. This chapter provides some examples of situations in which requirements for meta‐analysis are met and where meta‐analysis can therefore be used to combine findings across studies. It aims to...
This chapter provides information on various websites, professional societies, and journals on meta‐analysis, as well as special issues dedicated to meta‐analysis and books on systematic review methods and meta‐analysis. The Human Genome Epidemiology Network is a global collaboration committed to the assessment of the impact of human genome variati...
When the studies report means and standard deviations, the preferred effect size is usually the raw mean difference, the standardized mean difference, or the response ratio. These effect sizes are discussed in this chapter. When the outcome is reported on a meaningful scale and all studies in the analysis use the same scale, the meta‐analysis can b...
In this chapter, the authors show how they can use a prediction interval to describe the distribution of true effect sizes. They review how the prediction interval is used in primary studies, and also show how the same mechanism can be used for meta‐analysis. The summary line in a forest plot uses a diamond to depict the mean effect size and its co...
Most of the issues that one would address when reporting the results of a meta‐analysis are similar to those for reporting the results of a primary study. There are some unique issues as well, and this chapter addresses those issues. A common mistake is to use the fixed‐effect model on the basis that there is no evidence of heterogeneity. The fores...
A meta‐analysis of effect sizes addresses the magnitude of the effect. Vote counting is the process of counting the number of studies that are statistically significant and the number that are not, and then choosing the winner. A meta‐analysis of p‐values tells us only that the effect is probably not zero. This chapter describes two methods for per...
For studies that report a correlation between two continuous variables, the correlation coefficient itself can serve as the effect size index. The correlation is an intuitive measure that has been standardized to take account of different metrics in the original scales. Most meta‐analysts do not perform syntheses on the correlation coefficient itse...
Vote counting is the process of counting the number of studies that are statistically significant and comparing this with the number that are not statistically significant. In any event, the idea of vote counting is fundamentally flawed and the variants on this process are equally flawed. This chapter aims to explain why this is so, and to provide...
This chapter is adapted from the text Common Mistakes in Meta‐Analysis and How to Avoid Them. When the analysis is based on studies pulled from the literature, the random‐effects model is almost invariably the model that should be used. The random‐effects model works well if the following assumptions are met: the studies that were performed are a r...
In meta‐analysis, the confidence interval for the mean is traditionally based on the Z distribution, which yields a relatively narrow interval. When researchers use the random effects model, it would be better to use the Knapp–Hartung adjustment, which yields a wider confidence interval. The adjustment includes two components. First, it modifies th...
This chapter presents worked examples for exploring how to compute the measures of heterogeneity. It shows how to compute the effect size (the log odds ratio) and variance for each study. Further, the chapter also shows how to compute the effect size (the Fisher’s z transformation of the correlation coefficient) and variance for each study. It incl...
The first case of a complex data structure is the case where studies report data from two or more independent subgroups. The stage‐1 and stage‐2 patients represent two independent subgroups since each patient is included in one group or the other, but not both. This chapter aims to compute a summary effect for the impact of the intervention for sta...
For data from a prospective study, such as a randomized trial, that was originally reported as the number of events and non‐events in two groups (the classic 2 × 2 table), researchers typically compute a risk ratio, an odds ratio, and/or a risk difference. For risk ratios, computations are carried out on a log scale. The log risk ratio and the stan...
This chapter provides examples of how one might explain the results of a simple meta‐analysis, for example to a colleague. There is one example based on each of several effect sizes. The chapter introduces the analysis by providing some basic information such as the number of studies and the effect‐size index. It also provides the rationale for usi...
The report of a meta‐analysis will focus on the mean effect size, and then address heterogeneity as a separate matter, if at all. Castells et al. conducted a meta‐analysis of studies that assessed the impact of methylphenidate vs. placebo on the cognitive functioning of adults with attention deficit hyperactivity disorder. Katout et al. looked at t...
This chapter presents an overview of the key concepts discussed in part 7 of this book. The part discusses three cases where studies provide more than one unit of data for the analysis. These are the case of multiple independent subgroups within a study, multiple outcomes or time‐points based on the same subjects, and two or more treatment groups t...
A cumulative meta‐analysis is a meta‐analysis that is performed first with one study, then with two studies, and so on, until all relevant studies have been included in the analysis. Lau et al. used the streptokinase analysis to show the potential impact of meta‐analysis as part of the research process. They argued that if meta‐analysis had been av...
This chapter begins with an example to show how meta‐analysis and narrative review would approach the same question, and then uses this example to highlight the key differences between the two. The meta‐analysis allows us to combine the effects and evaluate the statistical significance of the summary effect. The meta‐analytic approaches allow us to...
In this chapter, the authors address a number of issues that are relevant to both subgroup analyses and to meta‐regression. The researcher must always choose between a fixed‐effect model and a random‐effects model. Researchers often ask about the practical implications of using a random‐effects model rather than a fixed‐effect model. Since the mean...
A central theme in this volume is the fact that we usually prefer to work with effect sizes, rather than p‐values. The reason reflects a fundamental issue that applies both to primary studies and to meta‐analysis, and is the subject of this chapter. Since narrative reviews typically work with p‐values while meta‐analyses typically work with effect...
The goal of a meta‐analysis is only rarely to synthesize data from a set of identical studies. Almost invariably, the goal is to broaden the base of studies in some way, expand the question, and study the pattern of answers. The question of whether it makes sense to perform a meta‐analysis, and the question of what kinds of studies to include, must...
This chapter provides an overview of software Comprehensive Meta‐Analysis (CMA) and shows how to use it to implement the ideas. The same approach could be used with any other program as well. The chapter also provides a sense for the look‐and‐feel of the program. CMA features a spreadsheet view and a menu‐driven interface. As such, it allows a rese...
In this chapter, the authors show how meta‐analysis can be used to compare the mean effect for different subgroups of studies. They present three computational models. These are fixed‐effect, random‐effects using separate estimates of 𝜏2, and random‐effects using a pooled estimate of 𝜏2. In a primary study, the t‐test can be used to compare the mea...
The effect size, a value which reflects the magnitude of the treatment effect or the strength of a relationship between two variables, is the unit of currency in a meta‐analysis. In this chapter, the effect size for each study is computed, and then the effect sizes is discussed to assess the consistency of the effect across studies and to compute a...
This chapter shows how the multiple regression used in primary studies can be applied to meta‐regression. It begins with the fixed‐effect model, which is simpler, and then moves on to the random‐effects model, which is generally more appropriate. Since the meaning of a summary effect size is different for fixed versus random effects, the null hypot...
This chapter focuses on two themes related to statistical power. The first theme is conceptual. The chapter discusses the factors that determine power and explores how the value of these factors may change as we move from a primary study to a meta‐analysis. The second theme is practical. The chapter briefly reviews the process of power analysis for...
Studies that used independent groups and studies that used matched groups were both used to yield estimates of the standardized mean difference. There is no problem in combining these estimates in a meta‐analysis since the effect size has the same meaning in all studies. The question of whether or not it is appropriate to combine effect sizes from...
This chapter aims to compute a summary effect for the intervention on Basic skills, which combines the data from reading and math. It investigates the difference in effect size for reading versus math, and explains the method used to compute this effect size and its variance. Since every study will be represented by one score in the meta‐analysis r...
This chapter discusses the reasons for publication bias and the evidence that it exists. It also outlines a series of methods that have been developed to assess the likely impact of bias in any given meta‐analysis. The chapter introduces the idea of a small‐study effect, and how this is often conflated with publication bias. In particular, it expla...
This chapter provides some context for the variance for specific effect sizes such as the standardized mean difference or a log risk ratio. The term precision is used as a general term to encompass three formal statistics, the variance, standard error, and confidence interval. The chapter outlines the relationship between the indices of precision....
Under the random‐effects model, the true effect size may vary from study to study. This chapter discusses approaches to identify and then quantify this heterogeneity. It describes the mechanism that is used to extract the true between‐studies variation from the observed variation. The chapter considers what is meant by ‘heterogeneous’ and then resp...
This chapter provides information on software used for meta‐analysis. The software Comprehensive Meta‐Analysis (CMA) was initially released in 2000 and has been updated on a regular basis since then. The next version is scheduled for release in 2021. For researchers who would prefer to use R to perform meta‐analysis, Wolfgang Viechtbauer has publis...
This chapter presents worked examples for continuous data (using the standardized mean difference), binary data (using the odds ratio) and correlational data (using the Fisher’s z transformation). It starts with the mean, standard deviation, and sample size, and uses the bias‐corrected standardized mean difference (Hedges’ g) as the effect size mea...
This chapter addresses various criticisms that have been leveled at meta‐analysis. They are one number cannot summarize a research field, the file drawer problem invalidates meta‐analysis, mixing apples and oranges, garbage in, garbage out, important studies are ignored, meta‐analysis can disagree with randomized trials, and meta‐analyses are perfo...
This chapter addresses how to proceed when we want to incorporate treatment groups in the same analysis. Specifically, it aims to compute a summary effect for the active intervention versus control and aims to investigate the difference in effect size for interventions. the chapter describes the difference between multiple outcomes and multiple com...
This chapter presents two methods, the Mantel–Haenszel method and the one‐step method (also known as the Peto method) for performing a meta‐analysis on odds ratios. For both methods we assume the data from each study are presented in the form of a 2 × 2 table. The Mantel–Haenszel method is based on the fixed‐effect model, where the weight assigned...
This chapter highlights the conceptual and practical differences between fixed‐effect and random‐effects models. Under the random‐effects model the goal is not to estimate one true effect, but to estimate the mean of a distribution of effects. Under the fixed‐effect model there is a wide range of weights whereas under the random‐effects model the w...
Researchers have developed the practice of classifying heterogeneity as being low, moderate, or high based on the value of I2. This chapter argues that the idea of classifying heterogeneity based on I2 should be strongly discouraged. In the transfusion analysis, the I2 statistic was 29%. In the off‐hours analysis, the I2 statistic was 75%. On that...
This chapter introduces the random‐effects model. It discusses the assumptions of this model, and show how these are reflected in the formulas used to compute a summary effect, and in the meaning of the summary effect. The fixed‐effect model starts with the assumption that the true effect size is the same in all studies. In a random‐effects meta‐an...
To compute the summary effect in a meta‐analysis the researchers compute an effect size for each study and then combine these effect sizes, rather than pooling the data directly. Van Howe published a review article in the International Journal of STD and AIDS that looked at the relationship between circumcision and HIV infection in Africa. The arti...
This chapter provides an overview of two issues. One is the approach to estimates of effect (known as artifact correction), which will be of interest to nearly anyone thinking about using meta‐analysis. The other is the methods that are commonly used to combine results in the field of psychometric meta‐analysis, which will be of interest primarily...
The #1 resource for carrying out educational research as part of postgraduate study.
High-quality educational research requires careful consideration of every aspect of the process. This all-encompassing textbook written by leading international experts gives students and early career researchers a considered overview of principles that underpin r...
Empirical evaluations of replication have become increasingly common, but there has been no unified approach to doing so. Some evaluations conduct only a single replication study while others run several, usually across multiple laboratories. Designing such programs has largely contended with difficult issues about which experimental components are...
Although statistical practices to evaluate intervention effects in single-case experimental design (SCEDs) have gained prominence in recent times, models are yet to incorporate and investigate all their analytic complexities. Most of these statistical models incorporate slopes and autocorrelations, both of which contribute to trend in the data. The...
States often turn to a data masking procedure called microsuppression in order to reduce the risk of disclosing student records when sharing data with external researchers. This process removes records deemed to have high risk for disclosure should data be released. However, this process can induce differences between the original data and the data...
Meta-analysis has been used to examine the effectiveness of childhood obesity prevention efforts, yet traditional conventional meta-analytic methods restrict the kinds of studies included, and either narrowly define mechanisms and agents of change, or examine the effectiveness of whole interventions as opposed to the specific actions that comprise...
Objective:
To evaluate the efficacy of childhood obesity interventions and conduct a taxonomy of intervention components that are most effective in changing obesity-related health outcomes in children 2-5 years of age.
Methods:
Comprehensive searches located 51 studies from 18,335 unique records. Eligible studies: (1) assessed children aged 2-5, l...
Introduction:
There is a great need for analytic techniques that allow for the synthesis of learning across seemingly idiosyncratic interventions.
Objectives:
The primary objective of this paper is to introduce taxonomic meta-analysis and explain how it is different from conventional meta-analysis.
Results:
Conventional meta-analysis has previous...
Many experimental designs in educational and behavioral research involve at least one level of clustering. Clustering affects the precision of estimators and its impact on statistics in cross-sectional studies is well known. Clustering also occurs in longitudinal designs where students that are initially grouped may be regrouped in the following ye...
Recent empirical evaluations of replication in psychology have reported startlingly few successful
replication attempts. At the same time, these programs have noted that the proper way to analyze
replication studies is far from a settled matter and have analyzed their data in several different ways. This presents two challenges to interpreting the...
In this study, we reanalyze recent empirical research on replication from a meta-analytic perspective. We argue that there are different ways to define "replication failure," and that analyses can focus on exploring variation among replication studies or assess whether their results contradict the findings of the original study. We apply this frame...
Immediacy is one of the necessary criteria to show strong evidence of treatment effect in single-case experimental designs (SCEDs). However, with the exception of Natesan and Hedges (2017), no inferential statistical tool has been used to demonstrate or quantify it until now. We investigate and quantify immediacy by treating the change points betwe...
In this rejoinder, we discuss Mathur and VanderWeele's response to our article, "Statistical Analyses for Studying Replication: Meta-Analytic Perspectives," which appears in this current issue. We attempt to clarify a point of confusion regarding the inclusion of an original study in an analysis of replication, and the potential impact of publicati...
The problem of assessing whether experimental results can be replicated is becoming increasingly important in many areas of science. It is often assumed that assessing replication is straightforward: All one needs to do is repeat the study and see whether the results of the original and replication studies agree. This article shows that the statist...
The Center for Open Science (COS) will create an ECR Data Resource Hub to facilitate rigorous and reproducible research practices such as data sharing and study registration. The Hub will integrate training materials, infrastructure, community engagement, and innovation in research to advance rigorous research skills and behavior across the STEM ed...
Systematic reviews are characterized by a methodical and replicable methodology and presentation. They involve a comprehensive search to locate all relevant published and unpublished work on a subject; a systematic integration of search results; and a critique of the extent, nature, and quality of evidence in relation to a particular research quest...
The concept of replication is fundamental to the logic and rhetoric of science, including the argument that science is self-correcting. Yet there is very little literature on the methodology of replication. In this article, I argue that the definition of replication should not require underlying effects to be identical, but should permit some varia...
Formal empirical assessments of replication have recently become more prominent in several areas of science, including psychology. These assessments have used different statistical approaches to determine if a finding has been replicated. The purpose of this article is to provide several alternative conceptual frameworks that lead to different stat...
BACKGROUND: The CONSORT (Consolidated Standards of Reporting Trials) Statement was developed to help biomedical researchers report randomised controlled trials (RCTs) transparently. We have developed an extension to the CONSORT 2010 Statement for social and psychological interventions (CONSORT-SPI 2018) to help behavioural and social scientists rep...
BACKGROUND: Randomised controlled trials (RCTs) are used to evaluate social and psychological interventions and inform policy decisions about them. Accurate, complete, and transparent reports of social and psychological intervention RCTs are essential for understanding their design, conduct, results, and the implications of the findings. However, t...
Background and purpose: Studies of education and learning that were described as experiments have been carried out in the USA by educational psychologists since about 1900. In this paper, we discuss the history of randomised trials in education in the USA in terms of five historical periods. In each period, the use of randomised trials was motivate...
Equation (26) is formatted incorrectly in the pdf version. It should appear as follows.
The scientific rigor of education research has improved dramatically since the year 2000. Much of the credit for this improvement is deserved by IES policies that helped create a demand or rigorous research, increased human capital capacity to carry out such work, and provided funding for the work itself, and collected evaluated and made available...
We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.
Objective/study question:
To estimate and compare sample average treatment effects (SATE) and population average treatment effects (PATE) of a resident duty hour policy change on patient and resident outcomes using data from the Flexibility in Duty Hour Requirements for Surgical Trainees Trial ("FIRST Trial").
Data sources/study setting:
Seconda...
"We propose to change the default P-value threshold forstatistical significance for claims of new discoveries from 0.05 to 0.005."
Although immediacy is one of the necessary criteria to show strong evidence of a causal relation in SCDs, no inferential statistical tool is currently used to demonstrate it. We propose a Bayesian unknown change-point model to investigate and quantify immediacy in SCD analysis. Unlike visual analysis that considers only 3-5 observations in consecut...
I discuss how methods that adjust for publication selection involve implicit or explicit selection models. Such models describe the relation between the studies conducted and those actually observed. I argue that the evaluation of selection models should include an evaluation of the plausibility of the empirical implications of that model. This inc...
A task force of experts was convened by the American Psychological Association (APA) to update the knowledge and policy about the impact of violent video game use on potential adverse outcomes. This APA Task Force on Media Violence examined the existing literature, including the meta-analyses in the field, since the last APA report on media violenc...
When we speak about heterogeneity in a meta-analysis, our intent is usually to understand the substantive implications of the heterogeneity. If an intervention yields a mean effect size of 50 points, we want to know if the effect size in different populations varies from 40 to 60, or from 10 to 90, because this speaks to the potential utility of th...