# Eric-Jan WagenmakersUniversity of Amsterdam | UVA · Department of Psychological Methods

Eric-Jan Wagenmakers

## About

439

Publications

153,441

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

28,689

Citations

## Publications

Publications (439)

The ongoing replication crisis in science has increased interest in the methodology of replication studies. We propose a novel Bayesian analysis approach using power priors: The likelihood of the original study's data is raised to the power of $\alpha$, and then used as the prior distribution in the analysis of the replication data. Posterior distr...

A fundamental part of experimental design is to determine the sample size of a study. However, sparse information about population parameters and effect sizes before data collection renders effective sample size planning challenging. Specifically, sparse information may lead research designs to be based on inaccurate a-priori assumptions, causing s...

Theoretical arguments and empirical investigations indicate that a high proportion of published findings are false or do not replicate. The current position paper provides a broad perspective on this scientific error, focusing both on reform history and on opportunities for future reform. Talking points are organised along four main themes: methodo...

Publication bias is a ubiquitous threat to the validity of meta‐analysis and the accumulation of scientific evidence. In order to estimate and counteract the impact of publication bias, multiple methods have been developed; however, recent simulation studies have shown the methods' performance to depend on the true data generating process, and no m...

We present a novel and easy to use method for calibrating error-rate based confidence intervals to evidence-based support intervals. Support intervals are obtained from inverting Bayes factors based on the point estimate and standard error of a parameter estimate. A $k$ support interval can be interpreted as "the interval contains parameter values...

A perennial objection against Bayes factor point-null hypothesis tests is that the point-null hypothesis is known to be false from the outset. We examine the consequences of approximating the sharp point-null hypothesis by a hazy ‘peri-null’ hypothesis instantiated as a narrow prior distribution centered on the point of interest. The peri-null Baye...

Uncertainty is ubiquitous in science, but scientific knowledge is often represented to the public and in educational contexts as certain and immutable. This contrast can foster distrust when scientific knowledge develops in a way that people perceive as a reversals, as we have observed during the ongoing COVID-19 pandemic. Drawing on research in st...

In this document, we outline the dataset that was used in the Many-Analysts Religion Project (MARP). Specifically, we provide details on how participants were recruited and what materials were used. The dataset itself is openly available at https://osf.io/k9puq/. If you want to use the data, please cite this document.

Null hypothesis statistical significance testing (NHST) is the dominant approach for evaluating results from randomized controlled trials. Whereas NHST comes with long-run error rate guarantees, its main inferential tool -- the $p$-value -- is only an indirect measure of evidence against the null hypothesis. The main reason is that the $p$-value is...

Power priors are used for incorporating historical data in Bayesian analyses by taking the likelihood of the historical data raised to the power $\alpha$ as the prior distribution for the model parameters. The power parameter $\alpha$ is typically unknown and assigned a prior distribution, most commonly a beta distribution. Here, we give a novel th...

Adjusting for publication bias is essential when drawing meta-analytic inferences. However,most methods that adjust for publication bias are sensitive to the particular researchconditions, such as the degree of heterogeneity in effect sizes across studies. Sladekovaet al. (2022) tried to circumvent this complication by selecting the methods that ar...

Meta-analysis is an important quantitative tool for cumulative science, but its application is frustrated by publication bias. In order to test and adjust for publication bias, we extend model-averaged Bayesian meta-analysis with selection models. The resulting robust Bayesian meta-analysis (RoBMA) methodology does not require all-or-none decisions...

Ly & Wagenmakers (in press) critiqued the Full Bayesian Significance Test (FBST) and the associated statistic FBST ev: similar to the frequentist p-value, FBST ev cannot quantify evidence for the null hypothesis, allows sampling to a foregone conclusion, and suffers from the Jeffreys-Lindley paradox. In response, Kelter (in press) suggested that th...

The current practice of reliability analysis is both uniform and troublesome: most reports consider only Cronbach’s α, and almost all reports focus exclusively on a point estimate, disregarding the impact of sampling error. In an attempt to improve the status quo we have implemented Bayesian estimation routines for five popular single-test reliabil...

In van Doorn et al. (2021) we outlined a series of open questions concerning Bayes factors for mixed effects model comparison, with an emphasis on the impact of aggregation, the effect of measurement error, the choice of prior distributions, and the detection of interactions. Seven expert commentaries (partially) addressed these initial questions....

Many studies report atypical responses to sensory information in autistic individuals, yet it is not clear which stages of processing are affected, with little consideration given to decision-making processes. We combined diffusion modelling with high-density EEG to identify which processing stages differ between 50 autistic and 50 typically develo...

In the main article on the Many-Analysts Religion Project (MARP) the results of the 120 analysis teams were summarized by taking each team's reported effect size and subjective assessment of the relation between religiosity and well-being, and the moderating role of cultural norms on this relation. Here, we discuss the findings in the main manuscri...

A frequentist confidence interval can be constructed by inverting a hypothesis test, such that the interval contains only parameter values that would not have been rejected by the test. We show how a similar definition can be employed to construct a Bayesian support interval. Consistent with Carnap’s theory of corroboration, the support interval co...

Bayesian inference requires the specification of prior distributions that quantify the pre-data uncertainty about parameter values. One way to specify prior distributions is through prior elicitation, an interview method guiding field experts through the process of expressing their knowledge in the form of a probability distribution. However, prior...

The last 25 years have shown a steady increase in attention for the Bayes factor as a tool for hypothesis evaluation and model selection. The present review highlights the potential of the Bayes factor in psychological research. We discuss six types of applications: Bayesian evaluation of point null, interval, and informative hypotheses, Bayesian e...

Auditors who perform audit sampling are often interested in obtaining evidence for or against the hypothesis that the misstatement in a population of items is lower than a critical limit, the so-called performance materiality. Here, we propose to perform this hypothesis test using a Bayesian approach that involves the use of an impartial prior dist...

A clear separation between exploratory and confirmatory is vital for ensuring the credibility of research. In this preprint, we present a two-stage Bayesian sequential procedure that combines a maximum of exploratory freedom in the first stage with a strictly confirmatory regimen in the second stage, while allowing for flexible sampling schemes and...

Construal Level Theory (CLT) is one of the most foundational theories in social cognition. However, the few replication studies available indicate a mixed pattern regarding the evidence supporting this theory. This article assesses the credibility of CLT more widely by using publication bias correction techniques on published studies in the CLT lit...

We outline an approximation to informed Bayes factors for a focal parameter $\theta$ that requires only the maximum likelihood estimate $\hat\theta$ and its standard error. The approximation uses an estimated likelihood of $\theta$ and assumes that the posterior distribution for $\theta$ is unaffected by the choice of prior distribution for the nui...

In a sequential hypothesis test, the analyst checks at multiple steps during data collection whether sufficient evidence has accrued to make a decision about the tested hypotheses. As soon as sufficient information has been obtained, data collection is terminated. Here, we compare two sequential hypothesis testing procedures that have recently been...

Current developments in the statistics community suggest that modern statistics education should be structured holistically, that is, by allowing students to work with real data and to answer concrete statistical questions, but also by educating them about alternative frameworks, such as Bayesian inference. In this article, we describe how we incor...

Cash transfers are among the most popular poverty interventions. Indeed the charity evaluator GiveWell even lists GiveDirectly - a charity that directly sends your donations as cash to people in extreme poverty - as one of their top-rated charities [https://www.givewell.org/charities/give-directly]. McGuire, Kaiser, and Bach-Mortensen1 conducted a...

The D-HEALTH trial concluded that “Administering vitamin D_3 monthly to unscreened older people did not reduce all-cause mortality”. Here we present the results of a Bayesian reanalysis and show that the data from the D-HEALTH trial, considered in isolation, strongly increase the plausibility that a vitamin D regimen is ineffective in lowering all-...

In psychology, preregistration is the most widely used method to ensure the confirmatory status of analyses. However, the method has disadvantages: not only is it perceived as effortful and time consuming, but reasonable deviations from the analysis plan demote the status of the study to exploratory. An alternative to preregistration is analysis bl...

The preregistration of research protocols and analysis plans is a main reform innovation to counteract confirmation bias in the social and behavioral sciences. While theoretical reasons to preregister are frequently discussed in the literature, the individually experienced advantages and disadvantages of this method remain largely unexplored. The g...

Hypotheses concerning the distribution of multinomial proportions typically entail exact equality constraints that can be evaluated using standard tests. Whenever researchers formulate inequality constrained hypotheses, however, they must rely on sampling-based methods that are relatively inefficient and computationally expensive. To address this p...

Testing the equality of two proportions is a common procedure in science, especially in medicine and public health. In these domains, it is crucial to be able to quantify evidence for the absence of a treatment effect. Bayesian hypothesis testing by means of the Bayes factor provides one avenue to do so, requiring the specification of prior distrib...

Tendeiro and Kiers (2019) provide a detailed and scholarly critique of Null Hypothesis Bayesian Testing (NHBT) and its central component-the Bayes factor-that allows researchers to update knowledge and quantify statistical evidence. Tendeiro and Kiers conclude that NHBT constitutes an improvement over frequentist p-values, but primarily elaborate o...

The Jeffreys-Lindley paradox exposes a rift between Bayesian and frequentist hypothesis testing that strikes at the heart of statistical inference. Contrary to what most current literature suggests, the paradox was central to the Bayesian testing methodology developed by Sir Harold Jeffreys in the late 1930s. Jeffreys showed that the evidence again...

Children with and without dyslexia differ in their behavioural responses to visual information, particularly when required to pool dynamic signals over space and time. Importantly, multiple processes contribute to behavioural responses. Here we investigated which processing stages are affected in children with dyslexia when performing visual motion...

We argue that statistical practice in the social and behavioural sciences benefits from transparency, a fair acknowledgement of uncertainty and openness to alternative interpretations. Here, to promote such a practice, we recommend seven concrete statistical procedures: (1) visualizing data; (2) quantifying inferential uncertainty; (3) assessing da...

Any large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, w...

On October 1st 2021, Merck issued a press release claiming that "molnupiravir (MK-4482, EIDD-2801), an investigational oral antiviral medicine, significantly reduced the risk of hospitalization or death at a planned interim analysis of the Phase 3 MOVe-OUT trial in at risk, non-hospitalized adult patients with mild-to-moderate COVID-19." Specifical...

Some important research questions require the ability to find evidence for two conditions being practically equivalent. This is impossible to accomplish within the traditional frequentist null hypothesis significance testing framework; hence, other methodologies must be utilized. We explain and illustrate three approaches for finding evidence for e...

Researchers conduct a meta-analysis in order to synthesize information across different studies. Compared to standard meta-analytic methods, Bayesian model-averaged meta-analysis offers several practical advantages including the ability to quantify evidence in favor of the absence of an effect, the ability to monitor evidence as individual studies...

We outline a Bayesian model‐averaged (BMA) meta‐analysis for standardized mean differences in order to quantify evidence for both treatment effectiveness δ and across‐study heterogeneity τ. We construct four competing models by orthogonally combining two present‐absent assumptions, one for the treatment effect and one for across‐study heterogeneity...

We outline a Bayesian model-averaged meta-analysis for standardized mean differences in order to quantify evidence for both treatment effectiveness $\delta$ and across-study heterogeneity $\tau$. We construct four competing models by orthogonally combining two present-absent assumptions, one for the treatment effect and one for across-study heterog...

The replicability of findings in experimental psychology can be improved by distinguishing sharply between hypothesis-generating research and hypothesis-testing research. This distinction can be achieved by preregistration, a method that has recently attracted widespread attention. Although preregistration is fair in the sense that it inoculates re...

Many studies report atypical responses to sensory information in autistic individuals, yet it is not clear which stages of processing are affected, with little consideration given to decision-making processes. We combined diffusion modelling with high-density EEG to identify which processing stages differ between 50 autistic and 50 typically develo...

Roberts (2020, Learning & Behavior, 48 [2], 191–192) discussed research claiming honeybees can do arithmetic. Some readers of this research might regard such claims as unlikely. The present authors used this example as a basis for a debate on the criterion that ought to be used for publication of results or conclusions that could be viewed as unlik...

Testing the equality of two proportions is a common procedure in science, especially in medicine and public health. In these domains it is crucial to be able to quantify evidence for the absence of a treatment effect. Bayesian hypothesis testing by means of the Bayes factor provides one avenue to do so, requiring the specification of prior distribu...

The impact of statistical methods on the audit practice is growing because of the increasing availability of audit data and the statistical methods to analyze these data. A key aspect in the statistical approach to auditing is assessing the strength of evidence for or against a hypothesis. Unfortunately, the often-used frequentist statistical metho...

Cognitive models provide a substantively meaningful quantitative description of latent cognitive processes. The quantitative formulation of these models supports cumulative theory building and enables strong empirical tests. However, the non-linearity of these models and pervasive correlations among model parameters pose special challenges when app...

Although Bayesian linear mixed effects models are increasingly popular for analysis of within-subject designs in psychology and other fields, there remains considerable ambiguity on the most appropriate Bayes factor hypothesis test to quantify the degree to which the data support the presence or absence of an experimental effect. Specifically, diff...

Meta-analysis is the predominant approach for quantitatively synthesizing a set of studies. If the studies themselves are of high quality, meta-analysis can provide valuable insights into the current scientific state of knowledge about a particular phenomenon. In psychological science, the most common approach is to conduct frequentist meta-analysi...

The “Full Bayesian Significance Test e -value”, henceforth FBST e v , has received increasing attention across a range of disciplines including psychology. We show that the FBST e v leads to four problems: (1) the FBST e v cannot quantify evidence in favor of a null hypothesis and therefore also cannot discriminate “evidence of absence” from “absen...

Popular in business, psychology, and the analysis of clinical trial data, the A/B test refers to a comparison between two proportions. Here we discuss two Bayesian A/B tests that allow users to monitor the uncertainty about a difference in two proportions as data accumulate over time. We emphasize the advantage of assigning a dependent prior distri...

Auditors often have prior information about the auditee before starting the substantive testing phase. We show that applying Bayesian statistics in substantive testing allows for integration of this information into the statistical analysis through the prior distribution. For example, an auditor might have performed an audit last year, they might h...

Children with and without dyslexia differ in their behavioural responses to visual information, particularly when required to pool dynamic signals over space and time. Importantly, multiple processes contribute to behavioural responses. Here we investigated which processing stages are affected in children with dyslexia when performing visual motion...

The multibridge R package allows a Bayesian evaluation of informed hypotheses H_r applied to frequency data from an independent binomial or multinomial distribution. multibridge uses bridge sampling to efficiently compute Bayes factors for the following hypotheses concerning the latent category proportions theta: (a) hypotheses that postulate equal...

We present consensus-based guidance for conducting and documenting multi-analyst studies. We discuss why broader adoption of the multi-analyst approach will strengthen the robustness of results and conclusions in empirical sciences.

Linear regression analyses commonly involve two consecutive stages of statistical inquiry. In the first stage, a single ‘best’ model is defined by a specific selection of relevant predictors; in the second stage, the regression coefficients of the winning model are used for prediction and for inference concerning the importance of the predictors. H...

The current practice of reliability analysis is both uniform and troublesome: most reports consider only Cronbach’s α, and almost all reports focus exclusively on a point estimate, disregarding the impact of sampling error. In an attempt to improve the status quo we have implemented Bayesian estimation routines for five popular single-test reliabil...

Popular measures of reliability for a single-test administration include coefficient alpha, coefficient lambda2, the greatest lower bound (glb), and coefficient omega. First, we show how these measures can be easily estimated within a Bayesian framework. Specifically, the posterior distribution for these measures can be obtained through Gibbs sampl...

We explore the promise of statistical reform by starting from the assumption that most researchers would endorse Merton's ethos of science as reflected in the four norms of communalism, universalism, disinterestedness, and organized skepticism. Translated to data analysis, these norms imply a need for transparency, a fair acknowledgement of uncerta...

Although Bayesian mixed models are increasingly popular for data analysis in psychology and other fields, there remains considerable ambiguity on the most appropriate Bayes factor hypothesis test to quantify the degree to which the data support the presence or absence of an experimental effect. Specifically, different choices for both the null mode...

Gautret and colleagues reported the results of a non-randomised case series which examined the effects of hydroxychloroquine and azithromycin on viral load in the upper respiratory tract of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) patients. The authors reported that hydroxychloroquine (HCQ) had significant virus reducing effects...

A perennial objection against Bayes factor point-null hypothesis tests is that the point-null hypothesis is known to be false from the outset. Following Morey and Rouder (2011) we examine the consequences of approximating the sharp point-null hypothesis by a hazy `peri-null' hypothesis instantiated as a narrow prior distribution centered on the poi...

Politicians who lie are more likely to be reelected. That is what Janezic and Gallego (1)
concluded. They asked 816 Spanish mayors to toss a coin, with only heads resulting in a desired personalized report of the study results. Mayors reported heads more often (68%) than expected by chance (50%), and reporting heads significantly predicted reelecti...

We conducted a preregistered, multi-laboratory project (k = 36; N = 3531) to assess the size and robustness of ego depletion effects using a novel replication method, termed the paradigmatic replication approach. Laboratories implemented one of two procedures that intended to manipulate self-control and tested performance on a subsequent measure of...