Eric-Jan WagenmakersUniversity of Amsterdam | UVA · Department of Psychological Methods
Eric-Jan Wagenmakers
About
518
Publications
225,688
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
42,665
Citations
Publications
Publications (518)
Proper data visualization helps researchers draw correct conclusions from their data and facilitates a more complete and transparent report of the results. In factorial designs, so-called raincloud plots have recently attracted attention as a particularly informative data visualization technique; raincloud plots can simultaneously show summary stat...
Here we test the feasibility of using decision markets to select studies for replication and provide evidence about the replicability of online experiments. Social scientists (n = 162) traded on the outcome of close replications of 41 systematically selected MTurk social science experiments published in PNAS 2015–2018, knowing that the 12 studies w...
One of the most common statistical analyses in experimental psychology concerns the comparison of two means using the frequentist t test. However, frequentist t tests do not quantify evidence and require various assumption tests. Recently, popularized Bayesian t tests do quantify evidence, but these were developed for scenarios where the two popula...
Paranormal beliefs encompass a wide variety of phenomena, including the existence of supernatural entities such as ghosts and witches, as well as extraordinary human abilities such as telepathy and clairvoyance. In the current study, we used a nationally representative sample ( N = 2534 ) to investigate the presence and correlates of paranormal bel...
Bayes factor hypothesis testing provides a powerful framework for assessing the evidence in favor of competing hypotheses. To obtain Bayes factors, statisticians often require advanced, non-standard tools, making it important to confirm that the methodology is computationally sound. This paper seeks to validate Bayes factor calculations by applying...
The purpose of outlier exclusion is to improve data quality and prevent modelmisspecification. However, procedures to identify and exclude outliers may bring unwantedside effects such as an increase in the Type I error rate. Here we study the side effects ofoutlier exclusion procedures on the Bayes factor hypothesis test. We focus on the Bayesianin...
Stratification is a statistical technique commonly used in audit sampling to increase efficiency. The reason for this increase is that stratification enhances the representativeness of the sample data and increases the accuracy of the misstatement estimate, which leads to a reduction in overall sample size. However, currently dominant methods for e...
Many-analysts studies explore how well an empirical claim withstands plausible alternative analyses of the same dataset by multiple, independent analysis teams. Conclusions from these studies typically rely on a single outcome metric (e.g. effect size) provided by each analysis team. Although informative about the range of plausible effects in a da...
Bayesian nested-model comparisons involve an assessment of the probabilities for a relatively simple model and a more general encompassing model. Since the simpler model can be viewed as a subset of the more complex model it is nested in, Popper has argued that the axioms of probability are violated when the simpler model is nonetheless assigned a...
Linde et al. (2021) compared three statistical procedures for establishing that two groups are equivalent on some kind of measure in terms of how often they conclude equivalence even though the groups are truly nonequivalent and how often they conclude equivalence when the groups are truly equivalent. They found that the “interval Bayes factor” pro...
Statistical methods play an important role in auditors’ analyses of their clients’ data. A key component of the statistical approach to auditing is assessing the strength of evidence for or against a hypothesis. We argue that the frequentist statistical methods often used by auditors cannot provide the statistical evidence that audit standards advo...
Considering their influential role in synthesizing the existing evidence on a particular topic, it is especially important that meta-analyses are conducted and reported to the highest standards, and the risk of bias is minimized. Preregistration can help detect and reduce bias arising from opportunistic use of ‘researcher degrees of freedom’. Howev...
Network psychometrics uses graphical models to assess the network structure of psychological variables. An important task in their analysis is determining which variables are unrelated in the network, i.e., are independent given the rest of the network variables. This conditional independence structure is a gateway to understanding the causal struc...
The vast majority of empirical research articles feature a single primary analysis outcome that is the result of a single analysis plan, executed by a single analysis team. However, recent multi-analyst projects have demonstrated that different analysis teams usually adopt a unique approach and that there exists considerable variability in the asso...
Parsimony has long served as a criterion for selecting between scientific theories, hypotheses, and models. Yet recent years have seen an explosion of incredibly complex models, such as deep neural networks (e.g., for 3D protein folding) and multi-model ensembles (e.g., for climate forecasting). This perspective aims to re-examine the principle of...
Bayes factor hypothesis testing provides a powerful framework for assessing the evidence in favor of competing hypotheses. To obtain Bayes factors, statisticians often require advanced, non-standard tools, making it important to confirm that the methodology is computationally sound. This paper seeks to validate Bayes factor calculations by applying...
In their book ‘Nudge: Improving Decisions About Health, Wealth and Happiness’, Thaler & Sunstein (2009) argue that choice architectures are promising public policy interventions. This research programme motivated the creation of ‘nudge units’, government agencies which aim to apply insights from behavioural science to improve public policy. We clos...
A fundamental part of experimental design is to determine the sample size of a study. However, sparse information about population parameters and effect sizes before data collection renders effective sample size planning challenging. Specifically, sparse information may lead research designs to be based on inaccurate a priori assumptions, causing s...
Publication selection bias undermines the systematic accumulation of evidence. To assess the extent of this problem, we survey over 68,000 meta‐analyses containing over 700,000 effect size estimates from medicine (67,386/597,699), environmental sciences (199/12,707), psychology (605/23,563), and economics (327/91,421). Our results indicate that met...
Many-analysts studies explore how well an empirical claim withstands plausible alternative analyses of the same data set by multiple, independent analysis teams. Conclusions from these studies typically rely on a single outcome metric (e.g., effect size) provided by each analysis team. Although informative about the range of plausible effects in a...
Forensic psychiatric hospitals regularly monitor the mental health and forensic risk factors of their patients. As part of this monitoring, staff score patients on various items. Common practice is to aggregate these scores across staff members. However, this is suboptimal because it assumes that assessors are interchangeable and that patients are...
Medical professionals, patients, students, and the public at large regularly need to interpret the outcome of medical tests. These tests are error-prone, however, and the fact that the outcome is positive (or negative) does not establish with certainty that the disease is present (or absent). The correct interpretation of the test outcome demands t...
Network psychometrics is a new direction in psychological research that conceptualizes psychological constructs as systems of interacting variables. In network analysis, variables are represented as nodes, and their interactions yield (partial) associations. Current estimation methods mostly use a frequentist approach, which does not allow for prop...
Meta-regression constitutes an essential meta-analytic tool for investigating sources of heterogeneity and assessing the impact of moderators. However, existing methods for meta-regression have limitations that include inadequate consideration of model uncertainty and poor performance under publication bias. To overcome these limitations, we extend...
The ongoing replication crisis in science has increased interest in the methodology of replication studies. We propose a novel Bayesian analysis approach using power priors: The likelihood of the original study’s data is raised to the power of $$\alpha $$ α , and then used as the prior distribution in the analysis of the replication data. Posterior...
贝叶斯统计应用于假设检验的方法贝叶斯因子在心理学研究中的应用日渐增加。贝叶斯因子分别量化支持相应假设或模型的证据,进而根据其大小做出当前数据更为支持哪种假设或模型的判断。然而,国内尙缺乏对方差分析的贝叶斯因子的原理与应用的介绍。本文首先介绍进行贝叶斯方差分析的基本思路及其计算原理,并结合实例数据,展示如何在JASP中对五种常用的心理学实验设计(单因素组间设计、单因素组内设计、二因素组间设计、二因素组内设计和二因素混合设计)进行贝叶斯方差分析及如何解读和汇报结果。贝叶斯方差分析提供了一个能有效替代传统方差分析的方案,是研究者进行统计推断的有力工具。
Huisman ( Psychonomic Bulletin & Review , 1–10. 2022) argued that a valid measure of evidence should indicate more support in favor of a true alternative hypothesis when sample size is large than when it is small. Bayes factors may violate this pattern and hence Huisman concluded that Bayes factors are invalid as a measure of evidence. In this brie...
A staple of Bayesian model comparison and hypothesis testing Bayes factors are often used to quantify the relative predictive performance of two rival hypotheses. The computation of Bayes factors can be challenging, however, and this has contributed to the popularity of convenient approximations such as the Bayesian information criterion (BIC). Unf...
After Bayes, the oldest Bayesian account of enumerative induction is given by Laplace's so-called rule of succession: if all $n$ observed instances of a phenomenon to date exhibit a given character, the probability that the next instance of that phenomenon will also exhibit the character is $\frac{n+1}{n+2}$. Laplace's rule however has the apparent...
Adjusting for publication bias is essential when drawing meta-analytic inferences. However, most methods that adjust for publication bias do not perform well across a range of research conditions, such as the degree of heterogeneity in effect sizes across studies. Sladekova et al. 2022 (Estimating the change in meta-analytic effect size estimates a...
Power priors are used for incorporating historical data in Bayesian analyses by taking the likelihood of the historical data raised to the power as the prior distribution for the model parameters. The power parameter is typically unknown and assigned a prior distribution, most commonly a beta distribution. Here, we give a novel theoretical result o...
The delayed and incomplete availability of historical findings and the lack of integrative and user-friendly software hampers the reliable interpretation of new clinical data. We developed a free, open, and user-friendly clinical trial aggregation program combining a large and representative sample of existing trial data with the latest classical a...
Bayesian model-averaged meta-analysis allows quantification of evidence for both treatment effectiveness $\mu$ and across-study heterogeneity $\tau$. We use the Cochrane Database of Systematic Reviews to develop discipline-wide empirical prior distributions for $\mu$ and $\tau$ for meta-analyses of binary and time-to-event clinical trial outcomes....
The equation for the Pearson correlation coefficient can be represented in a scatter plot as the difference in area between concordant and discordant rectangles, scaled by an area that represents the maximum possible concordance. Rarely employed in statistics textbooks, this simple visualizationmay facilitate a deeper understanding of the nature of...
The multibridge package allows a Bayesian evaluation of informed hypotheses ${\mathscr{H}}_{r}$ H r applied to frequency data from an independent binomial or multinomial distribution. multibridge uses bridge sampling to efficiently compute Bayes factors for the following hypotheses concerning the latent category proportions 𝜃 : (a) hypotheses that...
Cognitive models use mathematical equations to describe how observable human behavior, such as the speed and accuracy of a person’s answers on a knowledge test, relate to the underlying unobservable cognitive processes, such as retrieving information from memory. The numerical values of the parameters of these equations quantify different aspects o...
In their book `Nudge: Improving Decisions About Health, Wealth and Happiness', Thaler and Sunstein argue that choice architectures are promising public policy interventions. This research programme motivated the creation of so-called `nudge units' which aim to apply insights from behavioural science to improve public policy. We take a close look at...
Researchers conduct meta-analyses in order to synthesize information across different studies. Compared to standard meta-analytic methods, Bayesian model-averaged meta-analysis offers several practical advantages including the ability to quantify evidence in favor of the absence of an effect, the ability to monitor evidence as individual studies ac...
Empirical claims are inevitably associated with uncertainty, and a major goal of data analysis is therefore to quantify that uncertainty. Recent work has revealed that most uncertainty may lie not in what is usually reported (e.g., p value, confidence interval, or Bayes factor) but in what is left unreported (e.g., how the experiment was designed,...
Network psychometrics uses graphical models to assess the network structure of psychological variables. An important task in their analysis is determining which variables are unrelated in the network, i.e., are independent given the rest of the network variables. This conditional independence structure is a gateway to understanding the causal struc...
Running developmental experiments, particularly with infants, is often time‐consuming and intensive, and the recruitment of participants is hard and expensive. Thus, an important goal for developmental researchers is to optimize sampling plans such that neither too many nor too few participants are tested given the hypothesis of interest. One appro...
Running developmental experiments, particularly with infants, isoften time consuming and intensive, and recruitment of participantsis hard and expensive. Thus, an important goal for developmentalresearchers is to optimize sampling plans such that neither toomany nor too few participants are tested given the hypothesis ofinterest. One approach that...
van Doorn et al. (2021) outlined various questions that arise when conducting Bayesian model comparison for mixed effects models. Seven response articles offered their own perspective on the preferred setup for mixed model comparison, on the most appropriate specification of prior distributions, and on the desirability of default recommendations. T...
In van Doorn et al. (2021), we outlined a series of open questions concerning Bayes factors for mixed effects model comparison, with an emphasis on the impact of aggregation, the effect of measurement error, the choice of prior distributions, and the detection of interactions. Seven expert commentaries (partially) addressed these initial questions....
Flexibility in the design, analysis and interpretation of scientific studies creates a multiplicity of possible research outcomes. Scientists are granted considerable latitude to selectively use and report the hypotheses, variables and analyses that create the most positive, coherent and attractive story while suppressing those that are negative or...
Network psychometrics is a new direction in psychological research that conceptualizes multivariate data as interacting systems. Variables are represented as nodes and their interactions yield (partial) associations. Current estimation methods mostly use a frequentist approach, which does not allow for proper uncertainty quantification of the model...
In psychology, preregistration is the most widely used method to ensure the confirmatory status of analyses. However, the method has disadvantages: Not only is it perceived as effortful and time-consuming, but reasonable deviations from the analysis plan demote the status of the study to exploratory. An alternative to preregistration is analysis bl...
Huisman (2022) argued that a valid measure of evidence should indicate more support in favor of a true alternative hypothesis when sample size is large than when it is small. Bayes factors may violate this pattern and hence Huisman concluded that Bayes factors are invalid as a measure of evidence. In this brief comment we call attention to the foll...
The need for a comparison between two proportions (sometimes called an A/B test) often arises in business, psychology, and the analysis of clinical trial data. Here we discuss two Bayesian A/B tests that allow users to monitor the uncertainty about a difference in two proportions as data accumulate over time. We emphasize the advantage of assigning...
The FDA decision to approve Aducanumab has sparked controversy. Here, we argue that some of these controversies are rooted in the conclusions facilitated by classical frequentist analysis. We suggest that a Bayesian analysis would answer some of the burning questions the aducanumab trials have posed to the field. We applied Bayesian analysis of mod...
Separating confirmatory and exploratory analyses is vital for ensuring the credibility of research results. Here, we present a two-stage Bayesian sequential procedure that combines a maximum of exploratory freedom in the first stage with a strictly confirmatory regimen in the second stage. It allows for flexible sampling schemes and a statistically...
van Doorn et al. (2021) outlined various questions that arise when conducting Bayesian model comparison for mixed effects models. Seven response articles offered their own perspective on the preferred setup for mixed model comparison, on the most appropriate specification of prior distributions, and on the desirability of default recommendations. T...
Progress in psychology has been frustrated by challenges concerning replicability, generalizability, strategy selection, inferential reproducibility, and computational reproducibility. Although often discussed separately, these five challenges may share a common cause: insufficient investment of intellectual and nonintellectual resources into the t...
One of the most common statistical analyses in experimental psychology concerns the comparison of two means using the frequentist t-test. However, frequentist t-tests do not quantify evidence and require various assumption tests. Recently popularized Bayesian t-tests do quantify evidence, but these were developed for scenarios where the two populat...
The audit environment of today offers a wealth of information in the form of data. Consequently, data about the auditee is expected to guide and improve auditors’ approach to tests of details. However, to be able to make optimal use of this data, auditors must have tools that facilitate the effective and efficient use of quantitative information th...
Meta-analyses are essential for cumulative science, but their validity can be compromised by publication bias. To mitigate the impact of publication bias, one may apply publication-bias-adjustment techniques such as precision-effect test and precision-effect estimate with standard errors (PET-PEESE) and selection models. These methods, implemented...
Ly and Wagenmakers (Computational Brain & Behavior:1–8, in press) critiqued the Full Bayesian Significance Test (FBST) and the associated statistic FBST ev: similar to the frequentist p-value, FBST ev cannot quantify evidence for the null hypothesis, allows sampling to a foregone conclusion, and suffers from the Jeffreys-Lindley paradox. In respons...
The Jeffreys–Lindley paradox exposes a rift between Bayesian and frequentist hypothesis testing that strikes at the heart of statistical inference. Contrary to what most current literature suggests, the paradox was central to the Bayesian testing methodology developed by Sir Harold Jeffreys in the late 1930s. Jeffreys showed that the evidence for a...
Publication selection bias undermines the systematic accumulation of evidence. To assess the extent of this problem, we survey over 26,000 meta-analyses containing more than 800,000 effect size estimates from medicine, economics, and psychology. Our results indicate that meta-analyses in economics are the most severely contaminated by publication s...
Empirical claims are inevitably associated with uncertainty, and a major goal of data analysis is therefore to quantify that uncertainty. Recent work has revealed that most uncertainty may lie not in what is usually reported (e.g., p-value, confidence interval, or Bayes factor), but in what is left unreported (e.g., how the experiment was designed,...
A tradition that goes back to Sir Karl R. Popper assesses the value of a statistical test primarily by its severity: was there an honest and stringent attempt to prove the tested hypothesis wrong? For "error statisticians" such as Mayo (1996, 2018), and frequentists more generally, severity is a key virtue in hypothesis tests. Conversely, failure t...
The ongoing replication crisis in science has increased interest in the methodology of replication studies. We propose a novel Bayesian analysis approach using power priors: The likelihood of the original study's data is raised to the power of $\alpha$, and then used as the prior distribution in the analysis of the replication data. Posterior distr...
A fundamental part of experimental design is to determine the sample size of a study. However, sparse information about population parameters and effect sizes before data collection renders effective sample size planning challenging. Specifically, sparse information may lead research designs to be based on inaccurate a-priori assumptions, causing s...
Theoretical arguments and empirical investigations indicate that a high proportion of published findings are false or do not replicate. The current position paper provides a broad perspective on this scientific error, focusing both on reform history and on opportunities for future reform. Talking points are organised along four main themes: methodo...
Publication bias is a ubiquitous threat to the validity of meta‐analysis and the accumulation of scientific evidence. In order to estimate and counteract the impact of publication bias, multiple methods have been developed; however, recent simulation studies have shown the methods' performance to depend on the true data generating process, and no m...
We present a novel and easy to use method for calibrating error-rate based confidence intervals to evidence-based support intervals. Support intervals are obtained from inverting Bayes factors based on the point estimate and standard error of a parameter estimate. A $k$ support interval can be interpreted as "the interval contains parameter values...
A perennial objection against Bayes factor point-null hypothesis tests is that the point-null hypothesis is known to be false from the outset. We examine the consequences of approximating the sharp point-null hypothesis by a hazy ‘peri-null’ hypothesis instantiated as a narrow prior distribution centered on the point of interest. The peri-null Baye...
Uncertainty is ubiquitous in science, but scientific knowledge is often represented to the public and in educational contexts as certain and immutable. This contrast can foster distrust when scientific knowledge develops in a way that people perceive as a reversals, as we have observed during the ongoing COVID-19 pandemic. Drawing on research in st...
In this document, we outline the dataset that was used in the Many-Analysts Religion Project (MARP). Specifically, we provide details on how participants were recruited and what materials were used. The dataset itself is openly available at https://osf.io/k9puq/. If you want to use the data, please cite this document.
Null hypothesis statistical significance testing (NHST) is the dominant approach for evaluating results from randomized controlled trials. Whereas NHST comes with long-run error rate guarantees, its main inferential tool -- the $p$-value -- is only an indirect measure of evidence against the null hypothesis. The main reason is that the $p$-value is...
Power priors are used for incorporating historical data in Bayesian analyses by taking the likelihood of the historical data raised to the power $\alpha$ as the prior distribution for the model parameters. The power parameter $\alpha$ is typically unknown and assigned a prior distribution, most commonly a beta distribution. Here, we give a novel th...
Analysis of variance (ANOVA) is widely used to assess the influence of one or more (quasi-)experimental manipulations on a continuous outcome. Traditionally, ANOVA is carried out in a frequentist manner using p-values, but a Bayesian alternative has been proposed. It seems reasonable to assume that this Bayesian ANOVA would be a direct translation...