PreprintPDF Available

# The JASP Guidelines for Conducting and Reporting a Bayesian Analysis

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

## Abstract and Figures

Despite the increasing popularity of Bayesian inference in empirical research, few practical guidelines provide detailed recommendations for how to apply Bayesian procedures and interpret the results. Here we offer specific guidelines for four different stages of Bayesian statistical reasoning in a research setting: planning the analysis, executing the analysis, interpreting the results, and reporting the results. The guidelines for each stage are illustrated with a running example. Although the guidelines are geared toward analyses performed with the open-source statistical software JASP, most guidelines extend to Bayesian inference in general.
Content may be subject to copyright.
The JASP Guidelines for Conducting and
Reporting a Bayesian Analysis
Johnny van Doorn1, Don van den Bergh1, Udo B¨ohm1, Fabian
Dablander1, Koen Derks2, Tim Draws1, Alexander Etz3, Nathan J.
Evans1, Quentin F. Gronau1, Julia M. Haaf1, Max Hinne1,ˇ
Simon
Kucharsk´y1, Alexander Ly1,4, Maarten Marsman1, Dora Matzke1,
Akash R. Komarlu Narendra Gupta1, Alexandra Sarafoglou1,
Angelika Stefan1, Jan G. Voelkel5, and Eric-Jan Wagenmakers1
1University of Amsterdam
3University of California, Irvine
4Centrum Wiskunde & Informatica
5Stanford University
We thank Dr. Simons, two anonymous reviewers, and the editor for comments on an
earlier draft. Correspondence concerning this article may be addressed to Johnny van Doorn,
University of Amsterdam, Department of Psychological Methods, Valckeniersstraat 59, 1018
XA Amsterdam, the Netherlands. E-mail may be sent to JohnnyDoorn@gmail.com. This
work was supported in part by a Vici grant from the Netherlands Organization of Scientiﬁc
Research (NWO) awarded to EJW (016.Vici.170.083). DM is supported by a Veni grant (451-
15-010) from the NWO. MM is supported by a Veni grant (451-17-017) from the NWO. AE is
supported by a National Science Foundation Graduate Research Fellowship (DGE1321846).
Centrum Wiskunde & Informatica (CWI) is the national research institute for mathematics
and computer science in the Netherlands.
1
Abstract
Despite the increasing popularity of Bayesian inference in empirical
research, few practical guidelines provide detailed recommendations for
how to apply Bayesian procedures and interpret the results. Here we oﬀer
speciﬁc guidelines for four diﬀerent stages of Bayesian statistical reason-
ing in a research setting: planning the analysis, executing the analysis,
interpreting the results, and reporting the results. The guidelines for each
stage are illustrated with a running example. Although the guidelines are
geared toward analyses performed with the open-source statistical soft-
ware JASP, most guidelines extend to Bayesian inference in general.
Keywords: Bayesian inference, scientiﬁc reporting, statistical software.
In recent years Bayesian inference has become increasingly popular, both in
statistical science and in applied ﬁelds such as psychology, biology, and econo-
metrics (e.g., Vandekerckhove et al., 2018; Andrews & Baguley, 2013). For the
pragmatic researcher, the adoption of the Bayesian framework brings several ad-
vantages over the standard framework of frequentist null-hypothesis signiﬁcance
testing (NHST), including (1) the ability to obtain evidence in favor of the null
hypothesis and discriminate between “absence of evidence” and “evidence of ab-
sence” (Dienes, 2014; Keysers et al., 2020); (2) the ability to take into account
prior knowledge to construct a more informative test (Lee & Vanpaemel, 2018;
Gronau et al., 2020); and (3) the ability to monitor the evidence as the data ac-
cumulate (Rouder, 2014). However, the relative novelty of conducting Bayesian
analyses in applied ﬁelds means that there are no detailed reporting standards,
and this in turn may frustrate the broader adoption and proper interpretation
of the Bayesian framework.
Several recent statistical guidelines include information on Bayesian infer-
ence, but these guidelines are either minimalist (The BaSiS group, 2001; Ap-
pelbaum et al., 2018), focus only on relatively complex statistical tests (Depaoli
2
& van de Schoot, 2017), are too speciﬁc to a certain ﬁeld (Spiegelhalter et al.,
2000; Sung et al., 2005), or do not cover the full inferential process (Jarosz &
Wiley, 2014). The current article aims to provide a general overview of the
diﬀerent stages of the Bayesian reasoning process in a research setting. Specif-
ically, we focus on guidelines for analyses conducted in JASP (JASP Team,
2019; jasp-stats.org), although these guidelines can be generalized to other
software packages for Bayesian inference. JASP is an open-source statistical
software program with a graphical user interface that features both Bayesian
and frequentist versions of common tools such as the t-test, the ANOVA, and
regression analysis (e.g., Marsman & Wagenmakers, 2017; Wagenmakers, Love,
et al., 2018).
We discuss four stages of analysis: planning, executing, interpreting, and re-
porting. These stages and their individual components are summarized in Table
1 at the end of the manuscript. In order to provide a concrete illustration of the
guidelines for each of the four stages, each section features a data set reported by
Frisby & Clatworthy (1975). This data set concerns the time it took two groups
of participants to see a ﬁgure hidden in a stereogram – one group received ad-
vance visual information about the scene (i.e., the VV condition), whereas the
other group did not (i.e., the NV condition).1Three additional examples (mixed
ANOVA, correlation analysis, and a t-test with an informed prior) are provided
in an online appendix at https://osf.io/nw49j/. Throughout the paper, we
present three boxes that provide additional technical discussion. These boxes,
while not strictly necessary, may prove useful to readers interested in greater
detail.
1The variables are participant number, the time (in seconds) each participant needed to
see the hidden ﬁgure (i.e., fuse time), experimental condition (VV = with visual information,
NV = without visual information), and the log-transformed fuse time.
3
Stage 1: Planning the Analysis
Specifying the goal of the analysis. We recommend that researchers care-
fully consider their goal, that is, the research question that they wish to answer,
prior to the study (Jeﬀreys, 1939). When the goal is to ascertain the presence
or absence of an eﬀect, we recommend a Bayes factor hypothesis test (see Box
1). The Bayes factor compares the predictive performance of two hypotheses.
This underscores an important point: in the Bayes factor testing framework,
hypotheses cannot be evaluated until they are embedded in fully speciﬁed mod-
els with a prior distribution and likelihood (i.e., in such a way that they make
quantitative predictions about the data). Thus, when we refer to the predictive
performance of a hypothesis, we implicitly refer to the accuracy of the predic-
tions made by the model that encompasses the hypothesis (Etz et al., 2018).
When the goal is to determine the size of the eﬀect, under the assumption
that it is present, we recommend to plot the posterior distribution or summarize
it by a credible interval (see Box 2). Testing and estimation are not mutually
exclusive and may be used in sequence; for instance, one may ﬁrst use a test
to ascertain that the eﬀect exists, and then continue to estimate the size of the
eﬀect.
Box 1. Hypothesis testing. The principled approach to Bayesian hy-
pothesis testing is by means of the Bayes factor (e.g., Wrinch & Jeﬀreys,
1921; Etz & Wagenmakers, 2017; Jeﬀreys, 1939; Ly et al., 2016). The Bayes
factor quantiﬁes the relative predictive performance of two rival hypotheses,
and it is the degree to which the data demand a change in beliefs concern-
ing the hypotheses’ relative plausibility (see Equation 1). Speciﬁcally, the
ﬁrst term in Equation 1 corresponds to the prior odds, that is, the rela-
tive plausibility of the rival hypotheses before seeing the data. The second
4
term, the Bayes factor, indicates the evidence provided by the data. The
third term, the posterior odds, indicates the relative plausibility of the rival
hypotheses after having seen the data.
p(H1)
p(H0)
´¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹
Prior odds
×p(DH1)
p(DH0)
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹
Bayes factor10
=p(H1D)
p(H0D)
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹
Posterior odds
(1)
The subscript in the Bayes factor notation indicates which hypothesis is
supported by the data. BF10 indicates the Bayes factor in favor of H1
over H0, whereas BF01 indicates the Bayes factor in favor of H0over H1.
Speciﬁcally, BF10 =1
/BF01. Larger values of BF10 indicate more support for
H1. Bayes factors range from 0 to , and a Bayes factor of 1 indicates that
both hypotheses predicted the data equally well. This principle is further
illustrated in Figure 4.
Box 2. Parameter estimation. For Bayesian parameter estimation, in-
terest centers on the posterior distribution of the model parameters. The
posterior distribution reﬂects the relative plausibility of the parameter val-
ues after prior knowledge has been updated by means of the data. Speciﬁ-
cally, we start the estimation procedure by assigning the model parameters
a prior distribution that reﬂects the relative plausibility of each parameter
value before seeing the data. The information in the data is then used to
update the prior distribution to the posterior distribution. Parameter val-
ues that predicted the data relatively well receive a boost in plausibility,
whereas parameter values that predicted the data relatively poorly suﬀer
a decline (Wagenmakers et al., 2016). Equation 2 illustrates this principle.
The ﬁrst term indicates the prior beliefs about the values of parameter θ.
5
The second term is the updating factor: for each value of θ, the quality of
its prediction is compared to the average quality of the predictions over all
values of θ. The third term indicates the posterior beliefs about θ.
p(θ)
±
Prior belief
×
of speciﬁc θ
³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ
p(data θ)
p(data)
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶
Average predictive
adequacy across all θs
=p(θdata)
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹
Posterior belief
.(2)
The posterior distribution can be plotted or summarized by an x% credible
interval. An x% credible interval contains x% of the posterior mass. Two
popular ways of creating a credible interval are the highest density credible
interval, which is the narrowest interval containing the speciﬁed mass, and
the central credible interval, which is created by cutting oﬀ 100x
2% from
each of the tails of the posterior distribution.
Specifying the statistical model. The functional form of the model (i.e.,
the likelihood; Etz, 2018) is guided by the nature of the data and the research
question. For instance, if interest centers on the association between two vari-
ables, one may specify a bivariate normal model in order to conduct inference
on Pearson’s correlation parameter ρ. The statistical model also determines
which assumptions ought to be satisﬁed by the data. For instance, the statis-
tical model might assume the dependent variable to be normally distributed.
Violations of assumptions may be addressed at diﬀerent points in the analysis,
such as the data preprocessing steps discussed below, or by planning to conduct
robust inferential procedures as a contingency plan.
The next step in model speciﬁcation is to determine the sidedness of the
procedure. For hypothesis testing, this means deciding whether the procedure
6
is one-sided (i.e., the alternative hypothesis dictates a speciﬁc direction of the
population eﬀect) or two-sided (i.e., the alternative hypothesis dictates that
the eﬀect can be either positive or negative). The choice of one-sided versus
two-sided depends on the research question at hand and this choice should be
theoretically justiﬁed prior to the study. For hypothesis testing it is usually
the case that the alternative hypothesis posits a speciﬁc direction. In Bayesian
hypothesis testing, a one-sided hypothesis yields a more diagnostic test than a
two-sided alternative (e.g., Wetzels et al., 2009; Jeﬀreys, 1961, p.283).2
For parameter estimation, we recommend to always use the two-sided model
instead of the one-sided model: when a positive one-sided model is speciﬁed
but the observed eﬀect turns out to be negative, all of the posterior mass will
nevertheless remain on the positive values, falsely suggesting the presence of a
small positive eﬀect.
The next step in model speciﬁcation concerns the type and spread of the
prior distribution, including its justiﬁcation. For the most common statistical
models (e.g., correlations, t-tests, and ANOVA), certain “default” prior distri-
butions are available that can be used in cases where prior knowledge is absent,
vague, or diﬃcult to elicit (for more information, see Ly et al., 2016). These
priors are default options in JASP. In cases where prior information is present,
diﬀerent “informed” prior distributions may be speciﬁed. However, the more
the informed priors deviate from the default priors, the stronger becomes the
need for a justiﬁcation (see the informed t-test example in the online appendix
at https://osf.io/ybszx/). Additionally, the robustness of the result to dif-
ferent prior distributions can be explored and included in the report. This is an
2A one-sided alternative hypothesis makes a more risky prediction than a two-sided hy-
pothesis. Consequently, if the data are in line with the one-sided prediction, the one-sided
alternative hypothesis is rewarded with a greater gain in plausibility compared to the two-sided
alternative hypothesis; if the data oppose the one-sided prediction, the one-sided alternative
hypothesis is penalized with a greater loss in plausibility compared to the two-sided alternative
hypothesis.
7
important type of robustness check because the choice of prior can sometimes
impact our inferences, such as in experiments with small sample sizes or missing
data. In JASP, Bayes factor robustness plots show the Bayes factor for a wide
range of prior distributions, allowing researchers to quickly examine the extent
to which their conclusions depend on their prior speciﬁcation. An example of
such a plot is given later in Figure 7.
Specifying data preprocessing steps. Dependent on the goal of the
analysis and the statistical model, diﬀerent data preprocessing steps might be
taken. For instance, if the statistical model assumes normally distributed data,
a transformation to normality (e.g., the logarithmic transformation) might be
considered (e.g., Draper & Cox, 1969). Other points to consider at this stage
are when and how outliers may be identiﬁed and accounted for, which variables
are to be analyzed, and whether further transformation or combination of data
are necessary. These decisions can be somewhat arbitrary, and yet may exert
a large inﬂuence on the results (Wicherts et al., 2016). In order to assess the
degree to which the conclusions are robust to arbitrary modeling decisions, it
is advisable to conduct a multiverse analysis (Steegen et al., 2016). Preferably,
the multiverse analysis is speciﬁed at study onset. A multiverse analysis can
easily be conducted in JASP, but doing so is not the goal of the current paper.
Specifying the sampling plan. As may be expected from a framework
for the continual updating of knowledge, Bayesian inference allows researchers
to monitor evidence as the data come in, and stop whenever they like, for any
reason whatsoever. Thus, strictly speaking there is no Bayesian need to pre-
specify sample size at all (e.g., Berger & Wolpert, 1988). Nevertheless, Bayesians
are free to specify a sampling plan if they so desire; for instance, one may commit
to stop data collection as soon as BF10 10 or BF01 10. This approach can
also be combined with a maximum sample size (N), where data collection stops
8
when either the maximum Nor the desired Bayes factor is obtained, whichever
comes ﬁrst (for examples see Matzke et al., 2015; Wagenmakers et al., 2015).
In order to examine what sampling plans are feasible, researchers can con-
duct a Bayes factor design analysis (Sch¨onbrodt & Wagenmakers, 2018; Stefan
et al., 2019), a method that shows the predicted outcomes for diﬀerent designs
and sampling plans. Of course, when the study is observational and the data
are available ‘en bloc’, the sampling plan becomes irrelevant in the planning
stage.
Stereogram Example
First, we consider the research goal, which was to determine if participants
who receive advance visual information exhibit a shorter fuse time (Frisby &
Clatworthy, 1975). A Bayes factor hypothesis test can be used to quantify the
evidence that the data provide for and against the hypothesis that an eﬀect is
present. Should this test reveal support in favor of the presence of the eﬀect,
then we have grounds for a follow-up analysis in which the size of the eﬀect is
estimated.
Second, we specify the statistical model. The study focus is on the dif-
ference in performance between two between-subjects conditions, suggesting a
two-sample t-test on the fuse times is appropriate. The main measure of the
study is a reaction time variable, which can for various reasons be non-normally
distributed (Lo & Andrews, 2015; but see Schramm & Rouder, 2019). If our
data show signs of non-normality we will conduct two alternatives: a t-test
on the log-transformed fuse time data and a non-parametric t-test (i.e., the
Mann-Whitney U test), which is robust to non-normality and unaﬀected by the
log-transformation of the fuse times.
For hypothesis testing, we compare the null hypothesis (i.e., advance visual
9
information has no eﬀect on fuse times) to a one-sided alternative hypothesis
(i.e., advance visual information shortens the fuse times), in line with the di-
rectional nature of the original research question. The rival hypotheses are thus
H0δ=0 and H+δ>0, where δis the standardized eﬀect size (i.e., the popula-
tion version of Cohen’s d), H0denotes the null hypothesis, and H+denotes the
one-sided alternative hypothesis (note the ‘+’ in the subscript). For parameter
estimation (under the assumption that the eﬀect exists) we use the two-sided
t-test model and plot the posterior distribution of δ. This distribution can also
be summarized by a 95% central credible interval.
We complete the model speciﬁcation by assigning prior distributions to the
model parameters. Since we have only little prior knowledge about the topic,
we select a default prior option for the two-sample t-test, that is, a Cauchy
distribution3with spread rset to 1
/2. Since we speciﬁed a one-sided alter-
native hypothesis, the prior distribution is truncated at zero, such that only
positive eﬀect size values are allowed. The robustness of the Bayes factor to
this prior speciﬁcation can be easily assessed in JASP by means of a Bayes
factor robustness plot.
Since the data are already available, we do not have to specify a sampling
plan. The original data set has a total sample size of 103, from which 25
participants were eliminated due to failing an initial stereo-acuity test, leaving
78 participants (43 in the NV condition and 35 in the VV condition). The data
are available online at https://osf.io/5vjyt/.
3The fat-tailed Cauchy distribution is a popular default choice because it fulﬁlls particular
desiderata, see Jeﬀreys, 1961; Liang et al., 2008; Ly et al., 2016; Rouder et al., 2009 for details.
10
Stage 2: Executing the Analysis
Before executing the primary analysis and interpreting the outcome, it is im-
portant to conﬁrm that the intended analyses are appropriate and the models
are not grossly misspeciﬁed for the data at hand. In other words, it is strongly
recommended to examine the validity of the model assumptions (e.g., normally
distributed residuals or equal variances across groups). Such assumptions may
be checked by plotting the data, inspecting summary statistics, or conducting
formal assumption tests (but see Tijmstra, 2018).
A powerful demonstration of the dangers of failing to check the assumptions
is provided by Anscombe’s quartet (Anscombe, 1973; see Figure ??). The quar-
tet consists of four ﬁctitious data sets of equal size that each have the same
observed Pearson’s product moment correlation r, and therefore lead to the
same inferential result both in a frequentist and a Bayesian framework. How-
ever, visual inspection of the scatterplots immediately reveals that three of the
four data sets are not suitable for a linear correlation analysis, and the sta-
tistical inference for these three data sets is meaningless or even misleading.
This example highlights the adage that conducting a Bayesian analysis does not
safeguard against general statistical malpractice – the Bayesian framework is as
vulnerable to violations of assumptions as its frequentist counterpart. In cases
where assumptions are violated, an ordinal or non-parametric test can be used,
and the parametric results should be interpreted with caution.
Once the quality of the data has been conﬁrmed, the planned analyses can
be carried out. JASP oﬀers a graphical user interface for both frequentist and
Bayesian analyses. JASP 0.10.2 features the following Bayesian analyses: the
binomial test, the chi-square test, the multinomial test, the t-test (one-sample,
paired sample, two-sample, Wilcoxon rank sum, and Wilcoxon signed-rank
tests), A/B tests, ANOVA, ANCOVA, repeated measures ANOVA, correlations
11
(Pearson’s ρand Kendall’s τ), linear regression, and log-linear regression. After
loading the data into JASP, the desired analysis can be conducted by dragging
and dropping variables into the appropriate boxes; tick marks can be used to
select the desired output.
The resulting output (i.e., ﬁgures and tables) can be annotated and saved
as a .jasp ﬁle. Output can then be shared with peers, with or without the real
data in the .jasp ﬁle; if the real data are added, reviewers can easily reproduce
the analyses, conduct alternative analyses, or insert comments.
Stereogram Example
In order to check for violations of the assumptions of the t-test, the top row of
Figure 2 shows boxplots and Q-Q plots of the dependent variable fuse time,
split by condition. Visual inspection of the boxplots suggests that the variances
of the fuse times may not be equal (observed standard deviations of the NV
and VV groups are 8.085 and 4.802, respectively), suggesting the equal vari-
ance assumption may be unlikely to hold. There also appear to be a number
of potential outliers in both groups. Moreover, the Q-Q plots show that the
normality assumption of the t-test is untenable here. Thus, in line with our
analysis plan we will apply the log-transformation to the fuse times. The
standard deviations of the log-transformed fuse times in the groups are roughly
equal (observed standard deviations are 0.814 and 0.818 in the NV and the VV
group, respectively); the Q-Q plots in the bottom row of Figure 2 also look ac-
ceptable for both groups and there are no apparent outliers. However, it seems
prudent to assess the robustness of the result by also conducting the Bayesian
Mann-Whitney U test (van Doorn et al., 2019) on the fuse times.
Following the assumption check we proceed to execute the analyses in JASP.
For hypothesis testing, we obtain a Bayes factor using the one-sided Bayesian
12
Figure 1: Model misspeciﬁcation is also a problem for Bayesian analyses. The
four scatterplots on top show Anscombe’s quartet (Anscombe, 1973); the bottom
panel shows the corresponding inference, which is identical for all four scatter
plots. Except for the leftmost scatterplot, all data violate the assumptions of
the linear correlation analysis in important ways.
13
0
10
20
30
40
50
NV VV
condition
fuseTime
(a) Boxplots of raw fuse
times split by condition.
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Theoretical Quantiles
Standardised Residuals
(b) Q-Q plot of the raw fuse
times for the NV condition.
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Theoretical Quantiles
Standardised Residuals
(c) Q-Q plot of the raw fuse
times for the VV condition.
0
1
2
3
4
NV VV
condition
logFuseTime
(d) Boxplots of log fuse
times split by condition.
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Theoretical Quantiles
Standardised Residuals
(e) Q-Q plot of the log fuse
times for the NV condition
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Theoretical Quantiles
Standardised Residuals
(f) Q-Q plot of the log fuse
times for the VV condition.
Figure 2: Descriptive plots allow a visual assessment of the assumptions of the
t-test for the stereogram data. The top row shows descriptive plots for the raw
fuse times, and the bottom row shows descriptive plots for the log-transformed
fuse times. The left column shows boxplots, including the jittered data points,
for each of the experimental conditions. The middle and right columns show
Q-Q plots of the dependent variable, split by experimental condition. Here
we see that the log-transformed dependent variable is more appropriate for the
t-test, due to its distribution and absence of outliers. Figures from JASP.
two-sample t-test. Figure 3 shows the JASP user interface for this procedure.
For parameter estimation, we obtain a posterior distribution and credible inter-
val, using the two-sided Bayesian two-sample t-test. The relevant boxes for the
various plots were ticked, and an annotated .jasp ﬁle was created with all of the
relevant analyses: the one-sided Bayes factor hypothesis tests, the robustness
check, the posterior distribution from the two-sided analysis, and the one-sided
results of the Bayesian Mann-Whitney U test. The .jasp ﬁle can be found at
https://osf.io/nw49j/. The next section outlines how these results are to be
interpreted.
14
Figure 3: JASP menu for the Bayesian two-sample t-test. The left input panel
oﬀers the analysis options, including the speciﬁcation of the alternative hypoth-
esis and the selection of plots. The right output panel shows the corresponding
analysis output. The prior and posterior plot is explained in more detail in Fig-
ure 6b. The input panel speciﬁes the one-sided analysis for hypothesis testing;
a two-sided analysis for estimation can be obtained by selecting “Group 1
Group 2” under “Alt. Hypothesis”.
15
Stage 3: Interpreting the Results
With the analysis outcome in hand we are ready to draw conclusions. We
ﬁrst discuss the scenario of hypothesis testing, where the goal typically is to
conclude whether an eﬀect is present or absent. Then, we discuss the scenario
of parameter estimation, where the goal is to estimate the size of the population
eﬀect, assuming it is present. When both hypothesis testing and estimation
procedures have been planned and executed, there is no predetermined order
for their interpretation. One may adhere to the adage “only estimate something
when there is something to be estimated” (Wagenmakers, Marsman, et al., 2018)
and ﬁrst test whether an eﬀect is present, and then estimate its size (assuming
the test provided suﬃciently strong evidence against the null), or one may ﬁrst
estimate the magnitude of an eﬀect, and then quantify the degree to which
this magnitude warrants a shift in plausibility away from or toward the null
hypothesis (but see Box 3).
If the goal of the analysis is hypothesis testing, we recommend using the
Bayes factor. As described in Box 1, the Bayes factor quantiﬁes the relative
predictive performance of two rival hypotheses (Wagenmakers et al., 2016; see
Box 1). Importantly, the Bayes factor is a relative metric of the hypotheses’
predictive quality. For instance, if BF10 =5, this means that the data are 5
times more likely under H1than under H0. However, a Bayes factor in favor of
H1does not mean that H1predicts the data well. As Figure ?? illustrates, H1
provides a dreadful account of three out of four data sets, yet is still supported
relative to H0.
There can be no hard Bayes factor bound (other than zero and inﬁnity)
for accepting or rejecting a hypothesis wholesale, but there have been some
attempts to classify the strength of evidence that diﬀerent Bayes factors provide
(e.g., Jeﬀreys, 1939; Kass & Raftery, 1995). One such classiﬁcation scheme is
16
Figure 4: A graphical representation of a Bayes factor classiﬁcation table. As
the Bayes factor deviates from 1, which indicates equal support for H0and
H1, more support is gained for either H0or H1. Bayes factors between 1 and
3 are considered to be weak, Bayes factors between 3 and 10 are considered
moderate, and Bayes factors greater than 10 are considered strong evidence.
The Bayes factors are also represented as probability wheels, where the ratio of
white (i.e., support for H0) to red (i.e., support for H1) surface is a function
of the Bayes factor. The probability wheels further underscore the continuous
scale of evidence that Bayes factors represent. These classiﬁcations are heuristic
and should not be misused as an absolute rule for all-or-nothing conclusions.
shown in Figure 4. Several magnitudes of the Bayes factor are visualized as
a probability wheel, where the proportion of red to white is determined by
the degree of evidence in favor of H0and H1.4In line with Jeﬀreys, a Bayes
factor between 1 and 3 is considered weak evidence, a Bayes factor between 3
and 10 is considered moderate evidence, and a Bayes factor greater than 10
is considered strong evidence. Note that these classiﬁcations should only be
used as general rules of thumb to facilitate communication and interpretation
of evidential strength. Indeed, one of the merits of the Bayes factor is that it
oﬀers an assessment of evidence on a continuous scale.
When the goal of the analysis is parameter estimation, the posterior distri-
bution is key (see Box 2). The posterior distribution is often summarized by
a location parameter (point estimate) and uncertainty measure (interval esti-
mate). For point estimation, the posterior median (reported by JASP), mean,
4Speciﬁcally, the proportion of red is the posterior probability of H1under a prior probabil-
ity of 0.5; for a more detailed explanation and a cartoon see https://tinyurl.com/ydhfndxa
17
or mode can be reported, although these do not contain any information about
the uncertainty of the estimate. In order to capture the uncertainty of the es-
timate, an x% credible interval can be reported. The credible interval [L, U ]
has a x% probability that the true parameter lies in the interval that ranges
from Lto U(an interpretation that is often wrongly attributed to frequentist
conﬁdence intervals, see Morey et al., 2016). For example, if we obtain a 95%
credible interval of [1, 0.5]for eﬀect size δ, we can be 95% certain that the
true value of δlies between 1 and 0.5, assuming that the alternative hypothesis
we specify is true. In case one does not want to make this assumption, one can
present the unconditional posterior distribution instead. For more discussion
on this point, see Box 3.
Box 3. Conditional vs. Unconditional Inference. A widely accepted
view on statistical inference is neatly summarized by Fisher (1925), who
states that “it is a useful preliminary before making a statistical estimate
... to test if there is anything to justify estimation at all” (p. 300; see also
Haaf et al., 2019). In the Bayesian framework, this stance naturally leads to
posterior distributions conditional on H1, which ignores the possibility that
the null value could be true. Generally, when we say “prior distribution” or
“posterior distribution” we are following convention and referring to such
conditional distributions. However, only presenting conditional posterior
distributions can potentially be misleading in cases where the null hypoth-
esis remains relatively plausible after seeing the data. A general beneﬁt of
Bayesian analysis is that one can compute an unconditional posterior dis-
tribution for the parameter using model averaging (e.g., Hinne et al., 2020;
Clyde et al., 2011). An unconditional posterior distribution for a parame-
ter accounts for both the uncertainty about the parameter within any one
18
model and the uncertainty about the model itself, providing an estimate
of the parameter that is a compromise between the candidate models (for
more details see Hoeting et al., 1999). In the case of a t-test, which features
only the null and the alternative hypothesis, the unconditional posterior
consists of a mixture between a spike under H0and a bell-shaped posterior
distribution under H1(Rouder et al., 2018; van den Bergh et al., 2019).
Figure 5 illustrates this approach for the stereogram example.
19
δ
-2 -1 0 1 2
p(H0) = 0.5p(H1)=0.5
D
δ
-2 -1 0 1 2
p(H0|D)=0.3p(H1|D)=0.7
Figure 5: Updating the unconditional prior distribution to the uncon-
ditional posterior distribution for the stereogram example. The left
panel shows the unconditional prior distribution, which is a mixture
between the prior distributions under H0and H1. The prior distri-
bution under H0is a spike at the null value, indicated by the dotted
line; the prior distribution under H1is a Cauchy distribution, indi-
cated by the gray mass. The mixture proportion is determined by the
prior model probabilities p(H0)and p(H1). The right panel shows
the unconditional posterior distribution, after updating the prior dis-
tribution with the data D. This distribution is a mixture between the
posterior distributions under H0and H1., where the mixture propor-
tion is determined by the posterior model probabilities p(H0D)and
p(H1D). Since p(H1D)=0.7 (i.e., the data provide support for H1
over H0), about 70% of the unconditional posterior mass is comprised
of the posterior mass under H1, indicated by the gray mass. Thus,
the unconditional posterior distribution provides information about
plausible values for δ, while taking into account the uncertainty of H1
being true. In both panels, the dotted line and gray mass have been
rescaled such that the height of the dotted line and the highest point
of the gray mass reﬂect the prior (left) and posterior (right) model
probabilities.
Common Pitfalls in Interpreting Bayesian Results
Bayesian veterans sometimes argue that Bayesian concepts are intuitive and
easier to grasp than frequentist concepts. However, in our experience there
20
exist persistent misinterpretations of Bayesian results. Here we list ﬁve:
The Bayes factor does not equal the posterior odds; in fact, the posterior
odds are equal to the Bayes factor multiplied by the prior odds (see also
Equation 1). These prior odds reﬂect the relative plausibility of the rival
hypotheses before seeing the data (e.g., 50/50 when both hypotheses are
equally plausible, or 80/20 when one hypothesis is deemed to be 4 times
more plausible than the other). For instance, a proponent and a skeptic
may diﬀer greatly in their assessment of the prior plausibility of a hypoth-
esis; their prior odds diﬀer, and, consequently, so will their posterior odds.
However, as the Bayes factor is the updating factor from prior odds to
posterior odds, proponent and skeptic ought to change their beliefs to the
same degree (assuming they agree on the model speciﬁcation, including
the parameter prior distributions).
Prior model probabilities (i.e., prior odds) and parameter prior distribu-
tions play diﬀerent conceptual roles.5The former concerns prior beliefs
about the hypotheses, for instance that both H0and H1are equally plausi-
ble a priori. The latter concerns prior beliefs about the model parameters
within a model, for instance that all values of Pearson’s ρare equally likely
a priori (i.e., a uniform prior distribution on the correlation parameter).
Prior model probabilities and parameter prior distributions can be com-
bined to one unconditional prior distribution as described in Box 3 and
Figure 5.
The Bayes factor and credible interval have diﬀerent purposes and can
yield diﬀerent conclusions. Speciﬁcally, the typical credible interval for an
eﬀect size is conditional on H1being true and quantiﬁes the strength of an
5This confusion does not arise for the rarely reported unconditional distributions (see Box
3).
21
eﬀect, assuming it is present (but see Box 3); in contrast, the Bayes factor
quantiﬁes evidence for the presence or absence of an eﬀect. A common
misconception is to conduct a “hypothesis test” by inspecting only credi-
ble intervals. Berger (2006, p. 383) remarks: “[...] Bayesians cannot test
precise hypotheses using conﬁdence intervals. In classical statistics one
frequently sees testing done by forming a conﬁdence region for the param-
eter, and then rejecting a null value of the parameter if it does not lie in the
conﬁdence region. This is simply wrong if done in a Bayesian formulation
(and if the null value of the parameter is believable as a hypothesis).”
The strength of evidence in the data is easy to overstate: a Bayes factor
of 3 provides some support for one hypothesis over another, but should
not warrant the conﬁdent all-or-none acceptance of that hypothesis.
The results of an analysis always depend on the questions that were asked.6
For instance, choosing a one-sided analysis over a two-sided analysis will
impact both the Bayes factor and the posterior distribution. For an il-
lustration of this, see Figure 6 for a comparison between one-sided and a
two-sided results.
In order to avoid these and other pitfalls, we recommend that re-
searchers who are doubtful about the correct interpretation of their Bayesian
results solicit expert advice (for instance through the JASP forum at
http://forum.cogsci.nl).
6This is known as Jeﬀreys’s platitude: “The most beneﬁcial result that I can hope for as
a consequence of this work is that more attention will be paid to the precise statement of the
alternatives involved in the questions asked. It is sometimes considered a paradox that the
answer depends not only on the observations but on the question; it should be a platitude”
(Jeﬀreys, 1939, p.vi).
22
Stereogram Example
For hypothesis testing, the results of the one-sided t-test are presented in Fig-
ure 6a. The resulting BF+0is 4.567, indicating moderate evidence in favor of
H+: the data are approximately 4.6 times more likely under H+than under
H0. To assess the robustness of this result, we also planned a Mann-Whitney
U test. The resulting BF+0 is 5.191, qualitatively similar to the Bayes factor
from the parametric test. Additionally, we could have speciﬁed a multiverse
analysis where data exclusion criteria (i.e., exclusion vs. no exclusion), the
type of test (i..e, Mann-Whitney U vs. t-test), and data transformations (i.e.,
log-transformed vs. raw fuse times) are varied. Typically in multiverse analy-
ses these three decisions would be crossed, resulting in at least eight diﬀerent
analyses. However, in our case some of these analyses are implausible or re-
dundant. First, because the Mann-Whitney U test is unaﬀected by the log
transformation, the log-transformed and raw fuse times yield the same results.
Second, due to the multiple assumption violations, the t-test model for raw
fuse times is severely misspeciﬁed and hence we do not trust the validity of its
result. Third, we do not know which observations were excluded by Frisby &
Clatworthy (1975). Consequently, only two of these eight analyses are relevant.
Furthermore, a more comprehensive multiverse analysis could also consider the
Bayes factors from two-sided tests (i.e., BF10 = 2:323 for the t-test and BF10 =
2:557 for the Mann-Whitney U test). However, these tests are not in line with
the theory under consideration, as they answer a diﬀerent theoretical question
(see “Specifying the statistical model” in the Planning section).
For parameter estimation, the results of the two-sided t-test are presented
in Figure 6b. The 95% central credible interval for δis relatively wide, ranging
from 0.046 to 0.904: this means that, under the assumption that the eﬀect exists
and given the model we speciﬁed, we can be 95% certain that the true value of
23
δlies between 0.046 to 0.904. In conclusion, there is moderate evidence for the
presence of an eﬀect, and large uncertainty about its size.
Stage 4: Reporting the Results
For increased transparency, and to allow a skeptical assessment of the statistical
claims, we recommend to present an elaborate analysis report including relevant
tables, ﬁgures, assumption checks, and background information. The extent to
which this needs to be done in the manuscript itself depends on context. Ideally,
an annotated .jasp ﬁle is created that presents the full results and analysis
settings. The resulting ﬁle can then be uploaded to the Open Science Framework
(OSF; https://osf.io), where it can be viewed by collaborators and peers,
even without having JASP installed. Note that the .jasp ﬁle retains the settings
that were used to create the reported output. Analyses not conducted in JASP
should mimic such transparency, for instance through uploading an R-script. In
this section, we list several desiderata for reporting, both for hypothesis testing
and parameter estimation. What to include in the report depends on the goal
of the analysis, regardless of whether the result is conclusive or not.
In all cases, we recommend to provide a complete description of the prior
speciﬁcation (i.e., the type of distribution and its parameter values) and, es-
pecially for informed priors, to provide a justiﬁcation for the choices that were
made. When reporting a speciﬁc analysis, we advise to refer to the relevant
background literature for details. In JASP, the relevant references for speciﬁc
tests can be copied from the drop-down menus in the results panel.
When the goal of the analysis is hypothesis testing, it is key to outline which
hypotheses are compared by clearly stating each hypothesis and including the
corresponding subscript in the Bayes factor notation. Furthermore, we recom-
mend to include, if available, the Bayes factor robustness check discussed in the
24
section on planning (see Figure 7 for an example). This check provides an assess-
ment of the robustness of the Bayes factor under diﬀerent prior speciﬁcations:
if the qualitative conclusions do not change across a range of diﬀerent plausible
prior distributions, this indicates that the analysis is relatively robust. If this
plot is unavailable, the robustness of the Bayes factor can be checked manually
by specifying several diﬀerent prior distributions (see the mixed ANOVA anal-
ysis in the online appendix at https://osf.io/wae57/ for an example). When
data come in sequentially, it may also be of interest to examine the sequential
Bayes factor plot, which shows the evidential ﬂow as a function of increasing
sample size.
When the goal of the analysis is parameter estimation, it is important to
present a plot of the posterior distribution, or report a summary, for instance
through the median and a 95% credible interval. Ideally, the results of the
analysis are reported both graphically and numerically. This means that, when
possible, a plot is presented that includes the posterior distribution, prior dis-
tribution, Bayes factor, 95% credible interval, and posterior median.7
Numeric results can be presented either in a table or in the main text. If
relevant, we recommend to report the results from both estimation and hypoth-
esis test. For some analyses, the results are based on a numerical algorithm,
such as Markov chain Monte Carlo (MCMC), which yields an error percentage.
If applicable and available, the error percentage ought to be reported too, to
indicate the numeric robustness of the result. Lower values of the error per-
centage indicate greater numerical stability of the result.8In order to increase
7The posterior median is popular because it is robust to skewed distributions and invariant
under smooth transformations of parameters, although other measures of central tendency,
such as the mode or the mean, are also in common use.
8We generally recommend error percentages below 20% as acceptable. A 20% change in
the Bayes factor will result in one making the same qualitative conclusions. However, this
threshold naturally increases with the magnitude of the Bayes factor. For instance, a Bayes
factor of 10 with a 50% error percentage could be expected to ﬂuctuate between 5 and 15
upon recomputation. This could be considered a large change. However, with a Bayes factor
of 1000 a 50% reduction would still leave us with overwhelming evidence.
25
numerical stability, JASP includes an option to increase the number of samples
for MCMC sampling when applicable.
Stereogram Example
This is an example report of the stereograms t-test example:
Here we summarize the results of the Bayesian analysis for the
stereogram data. For this analysis we used the Bayesian t-test frame-
work proposed by Jeﬀreys (1961, see also Rouder et al. 2009). We
analyzed the data with JASP (JASP Team, 2019). An annotated
.jasp ﬁle, including distribution plots, data, and input options,
is available at https://osf.io/25ekj/. Due to model misspeci-
ﬁcation (i.e., non-normality, presence of outliers, and unequal vari-
ances), we applied a log-transformation to the fuse times. This reme-
died the misspeciﬁcation. To assess the robustness of the results, we
also applied a Mann-Whitney U test.
First, we discuss the results for hypothesis testing. The null
hypothesis postulates that there is no diﬀerence in log fuse time
between the groups and therefore H0δ=0. The one-sided alter-
native hypothesis states that only positive values of δare possible,
and assigns more prior mass to values closer to 0 than extreme val-
ues. Speciﬁcally, δwas assigned a Cauchy prior distribution with
r=1
/2, truncated to allow only positive eﬀect size values. Figure
6a shows that the Bayes factor indicates evidence for H+; speciﬁ-
cally, BF+0=4.567, which means that the data are approximately
4.5 times more likely to occur under H+than under H0. This result
indicates moderate evidence in favor of H+. The error percentage
is <0.001%, which indicates great stability of the numerical algo-
26
rithm that was used to obtain the result. The Mann-Whitney U
test yielded a qualitatively similar result, BF+0is 5.191. In order to
asses the robustness of the Bayes factor to our prior speciﬁcation,
Figure 7 shows BF+0as a function of the prior width r. Across
a wide range of widths, the Bayes factor appears to be relatively
stable, ranging from about 3 to 5.
Second, we discuss the results for parameter estimation. Of in-
terest is the posterior distribution of the standardized eﬀect size δ
(i.e., the population version of Cohen’s d, the standardized diﬀer-
ence in mean fuse times). For parameter estimation, δwas assigned
a Cauchy prior distribution with r=1
/2. Figure 6b shows that
the median of the resulting posterior distribution for δequals 0.47
with a central 95% credible interval for δthat ranges from 0.046 to
0.904. If the eﬀect is assumed to exist, there remains substantial
uncertainty about its size, with values close to 0 having the same
posterior density as values close to 1.
Limitations and Challenges
The Bayesian toolkit for the empirical social scientist still has some limitations
to overcome. First, for some frequentist analyses, the Bayesian counterpart has
not yet been developed or implemented in JASP. Secondly, some analyses in
JASP currently provide only a Bayes factor, and not a visual representation of
the posterior distributions, for instance due to the multidimensional parameter
space of the model. Thirdly, some analyses in JASP are only available with a
relatively limited set of prior distributions. However, these are not principled
limitations and the software is actively being developed to overcome these lim-
itations. When dealing with more complex models that go beyond the staple
27
-2.0 -1.0 0.0 1.0 2.0
0.0
0.5
1.0
1.5
2.0
2.5
Density
Eect size δ
BF+0 =4.567
BF0 + =0.219
median = 0.469
95% CI: [0.083, 0.909]
data|H+
data|H0
Posterior
Prior
(a) One-sided analysis for testing:
H+δ>0
-2.0 -1.0 0.0 1.0 2.0
0.0
0.5
1.0
1.5
2.0
2.5
Density
Eect size δ
BF10 =2.323
BF0 1 =0.431
median = 0.468
95% CI: [0.046, 0.904]
data|H1
data|H0
Posterior
Prior
(b) Two-sided analysis for estimation:
H1δCauchy
Figure 6: Bayesian two-sample t-test for the parameter δ. The probability wheel
on top visualizes the evidence that the data provide for the two rival hypotheses.
The two gray dots indicate the prior and posterior density at the test value
(Dickey & Lientz, 1970; Wagenmakers et al., 2010). The median and the 95%
central credible interval of the posterior distribution are shown in the top right
corner. The left panel shows the one-sided procedure for hypothesis testing and
the right panel shows the two-sided procedure for parameter estimation. Both
ﬁgures from JASP.
analyses such as t-tests, there exist a number of software packages that allow
custom coding, such as JAGS (Plummer, 2003) or Stan (Carpenter et al., 2017).
Another option for Bayesian inference is to code the analyses in a programming
language such as R (R Core Team, 2018) or Python (van Rossum, 1995). This
requires a certain degree of programming ability, but grants the user more ﬂexi-
bility. Popular packages for conducting Bayesian analyses in R are the BayesFac-
tor package (Morey & Rouder, 2015) and the brms package (B¨urkner, 2017),
among others (see https://cran.r-project.org/web/views/Bayesian.html
for a more exhaustive list). For Python, a popular package for Bayesian analy-
ses is PyMC3 (Salvatier et al., 2016). The practical guidelines provided in this
paper can largely be generalized to the application of these software programs.
28
0 0.25 0.5 0.75 1 1.25 1.5
1/3
1
3
10
30
Anecdotal
Moderate
Strong
Anecdotal
Evidence
BF+ 0
Cauchy prior width
Evidence for H+
Evidence for H0
max BF+0:
user prior:
wide prior:
ultrawide prior:
BF+ 0 =4.567
BF+ 0 =3.054
BF+ 0 =3.855
5.142 at r =0.3801
Figure 7: The Bayes factor robustness plot. The maximum BF+0is attained
when setting the prior width rto 0.38. The plot indicates BF+0for the user
speciﬁed prior ( r=1
/2), wide prior (r=1), and ultrawide prior (r=2).
The evidence for the alternative hypothesis is relatively stable across a wide
range of prior distributions, suggesting that the analysis is robust. However,
the evidence in favor of H+is not particularly strong and will not convince a
skeptic.
29
We have attempted to provide concise recommendations for planning, execut-
ing, interpreting, and reporting Bayesian analyses. These recommendations are
summarized in Table 1. Our guidelines focused on the standard analyses that
are currently featured in JASP. When going beyond these analyses, some of
the discussed guidelines will be easier to implement than others. However, the
general process of transparent, comprehensive, and careful statistical reporting
extends to all Bayesian procedures and indeed to statistical analyses across the
board.
Author Contributions
JvD wrote the main manuscript. EJW, AE, JH, and JvD contributed to
manuscript revisions. All authors reviewed the manuscript and provided feed-
back.
Open Practices Statement
The data and materials are available at https://osf.io/nw49j/.
30
Stage Recommendation
Planning Write the methods section in advance of data collection
Distinguish between exploratory and conﬁrmatory research
Specify the goal; estimation, testing, or both
If the goal is testing, decide on one-sided or two-sided procedure
Choose a statistical model
Determine which model checks will need to be performed
Specify which steps can be taken to deal with possible model violations
Choose a prior distribution
Consider how to assess the impact of prior choices on the inferences
Specify the sampling plan
Consider a Bayes factor design analysis
Preregister the analysis plan for increased transparency
Consider specifying a multiverse analysis
Executing Check the quality of the data (e.g., assumption checks)
Annotate the JASP output
Interpreting Beware of the common pitfalls
Use the correct interpretation of Bayes factor and credible interval
When in doubt, ask for advice (e.g., on the JASP forum)
Reporting Mention the goal of the analysis
Include a plot of the prior and posterior distribution, if available
If testing, report the Bayes factor, including its subscripts
If estimating, report the posterior median and x% credible interval
Include which prior settings were used
Justify the prior settings (particularly for informed priors in a testing scenario)
Discuss the robustness of the result
If relevant, report the results from both estimation and hypothesis testing
Refer to the statistical literature for details about the analyses used
Consider a sequential analysis
Report the results any multiverse analyses, if conducted
Make the .jasp ﬁle and data available online
Table 1: A summary of the guidelines for the diﬀerent stages of a Bayesian
analysis, with a focus on analyses conducted in JASP. Note that the stages have
a predetermined order, but the individual recommendations can be rearranged
where necessary.
31
References
Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian
methods in psychology. British Journal of Mathematical and Statistical Psy-
chology,66 , 1–7.
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statisti-
cian,27 , 17–21.
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M.,
& Rao, S. M. (2018). Journal article reporting standards for quantitative
research in psychology: The APA publications and communications board
task force report. American Psychologist,73 , 3–25.
Berger, J. O. (2006). Bayes factors. In S. Kotz, N. Balakrishnan, C. Read,
B. Vidakovic, & N. L. Johnson (Eds.), Encyclopedia of statistical sciences,
vol. 1 (2nd ed.) (pp. 378–386). Hoboken, NJ: Wiley.
Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.).
Hayward (CA): Institute of Mathematical Statistics.
urkner, P.-C. (2017). brms: An R package for Bayesian multilevel models
using Stan. Journal of Statistical Software,80 , 1–28.
Carpenter, B., Gelman, A., Hoﬀman, M., Lee, D., Goodrich, B., Betancourt,
M., . . . Riddell, A. (2017). Stan: A probabilistic programming language.
Journal of Statistical Software,76 , 1–37.
Clyde, M. A., Ghosh, J., & Littman, M. L. (2011). Bayesian adaptive sampling
for variable selection and model averaging. Journal of Computational and
Graphical Statistics,20 , 80–101.
32
Depaoli, S., & van de Schoot, R. (2017). Improving transparency and replication
in Bayesian statistics: The WAMBS-checklist. Psychological Methods,22 ,
240–261.
Dickey, J. M., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp
hypotheses about chances, the order of a Markov chain. The Annals of Math-
ematical Statistics,41 , 214–226.
Dienes, Z. (2014). Using Bayes to get the most out of non-signiﬁcant results.
Frontiers in Psycholology,5:781 .
Draper, N. R., & Cox, D. R. (1969). On distributions and their transformation to
normality. Journal of the Royal Statistical Society: Series B (Methodological),
31 , 472–476.
Etz, A. (2018). Introduction to the concept of likelihood and its applications.
Advances in Methods and Practices in Psychological Science,1, 60–69.
Etz, A., Haaf, J. M., Rouder, J. N., & Vandekerckhove, J. (2018). Bayesian
inference and testing any hypothesis you can specify. Advances in Methods
and Practices in Psychological Science,1(2), 281–295.
Etz, A., & Wagenmakers, E.-J. (2017). J. B. S. Haldane’s contribution to the
Bayes factor hypothesis test. Statistical Science ,32 , 313–329.
Fisher, R. (1925). Statistical methods for research workers (12th ed.). Edinburgh
Oliver & Boyd.
Frisby, J. P., & Clatworthy, J. L. (1975). Learning to see complex random-dot
stereograms. Perception,4, 173–178.
Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (2020). Informed Bayesian t-tests.
The American Statistician,74 , 137–143.
33
Haaf, J., Ly, A., & Wagenmakers, E. (2019). Retire signiﬁcance, but still test
hypotheses. Nature,567 (7749), 461.
Hinne, M., Gronau, Q. F., van den Bergh, D., & Wagenmakers, E.-J. (2020). A
conceptual introduction to Bayesian model averaging. Advances in Methods
and Practices in Psychological Science,3, 200–215.
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian
model averaging: a tutorial. Statistical science , 382–401.
Jarosz, A. F., & Wiley, J. (2014). What are the odds? A practical guide to
computing and reporting Bayes factors. Journal of Problem Solving,7, 2-9.
JASP Team. (2019). JASP (Version 0.9.2)[Computer software]. Retrieved from
https://jasp-stats.org/
Jeﬀreys, H. (1939). Theory of probability (1st ed.). Oxford University Press.
Jeﬀreys, H. (1961). Theory of probability (3rd ed.). Oxford University Press.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American
Statistical Association,90 , 773–795.
Keysers, C., Gazzola, V., & Wagenmakers, E.-J. (2020). Using Bayes factor
hypothesis testing in neuroscience to establish evidence of absence. Nature
Neuroscience,23 , 788–799.
Lee, M. D., & Vanpaemel, W. (2018). Determining informative priors for
cognitive models. Psychonomic Bulletin & Review,25 , 114–127.
Liang, F., German, R. P., Clyde, A., & Berger, J. (2008). Mixtures of g priors for
Bayesian variable selection. Journal of the American Statistical Association,
103 , 410–424.
34
Lo, S., & Andrews, S. (2015). To transform or not to transform: Using gener-
alized linear mixed models to analyse reaction time data. Frontiers in psy-
chology,6, 1171.
Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016). Harold Jeﬀreys’s default
Bayes factor hypothesis tests: Explanation, extension, and application in
psychology. Journal of Mathematical Psychology,72 , 19–32.
Marsman, M., & Wagenmakers, E.-J. (2017). Bayesian beneﬁts with JASP.
European Journal of Developmental Psychology ,14 , 545–555.
Matzke, D., Nieuwenhuis, S., van Rijn, H., Slagter, H. A., van der Molen, M. W.,
& Wagenmakers, E.-J. (2015). The eﬀect of horizontal eye movements on free
recall: A preregistered adversarial collaboration. Journal of Experimental
Psychology: General,144 , e1–e15.
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J.
(2016). The fallacy of placing conﬁdence in conﬁdence intervals. Psychonomic
Bulletin & Review,23 , 103–123.
Morey, R. D., & Rouder, J. N. (2015). BayesFactor 0.9.11-
1. Comprehensive R Archive Network. Retrieved from
http://cran.r-project.org/web/packages/BayesFactor/index.html
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical
models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.),
Proceedings of the 3rd international workshop on distributed statistical com-
puting. Vienna, Austria.
R Core Team. (2018). R: A language and environment for statistical com-
puting [Computer software manual]. Vienna, Austria. Retrieved from
https://www.R-project.org/
35
Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psycho-
nomic Bulletin & Review,21 , 301–308.
Rouder, J. N., Haaf, J. M., & Vandekerckhove, J. (2018). Bayesian inference for
psychology, part iv: Parameter estimation and bayes factors. Psychonomic
bulletin & review,25 (1), 102–113.
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009).
Bayesian ttests for accepting and rejecting the null hypothesis. Psychonomic
Bulletin & Review,16 , 225–237.
Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic program-
ming in Python using PyMC3. PeerJ Computer Science,2, e55.
Sch¨onbrodt, F. D., & Wagenmakers, E.-J. (2018). Bayes factor design analysis:
Planning for compelling evidence. Psychonomic Bulletin & Review,25 , 128–
142.
Schramm, P., & Rouder, J. N. (2019). Are reaction time transformations really
beneﬁcial? PsyArXiv. March,5.
Spiegelhalter, D. J., Myles, J. P., Jones, D. R., & Abrams, K. R. (2000).
Bayesian methods in health technology assessment: A review. Health Tech-
nology Assessment,4, 1–130.
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing
transparency through a multiverse analysis. Perspectives on Psychological
Science,11 , 702–712.
Stefan, A. M., Gronau, Q. F., Sch¨onbrodt, F. D., & Wagenmakers, E.-J. (2019).
A tutorial on Bayes factor design analysis using an informed prior. Behavior
Research Methods,51 , 1042–1058.
36
Sung, L., Hayden, J., Greenberg, M. L., Koren, G., Feldman, B. M., & Tomlin-
son, G. A. (2005). Seven items were identiﬁed for inclusion when reporting
a Bayesian analysis of a clinical study. Journal of Clinical Epidemiology,58 ,
261–268.
The BaSiS group. (2001). Bayesian standards in science: Standards for report-
ing of Bayesian analyses in the scientiﬁc literature. Internet. Retrieved from
http://lib.stat.cmu.edu/bayesworkshop/2001/BaSis.html
Tijmstra, J. (2018). Why checking model assumptions using null hypothesis
signiﬁcance tests does not suﬃce: A plea for plausibility. Psychonomic bulletin
& review,25 , 548–559.
Vandekerckhove, J., Rouder, J. N., & Kruschke, J. K. (Eds.). (2018). Beyond the
new statistics: Bayesian inference for psychology [special issue]. Psychonomic
Bulletin & Review,25 .
van den Bergh, D., Haaf, J. M., Ly, A., Rouder, J. N., & Wagenmakers, E.-J.
(2019). A cautionary note on estimating eﬀect size. PsyArXiv. Retrieved
from psyarxiv.com/h6pr8
van Doorn, J., Ly, A., Marsman, M., & Wagenmakers, E.-J. (2019). Bayesian
rank-based hypothesis testing for the rank sum test, the signed rank test, and
Spearman’s rho. arXiv preprint arXiv:1712.06941 .
van Rossum, G. (1995). Python tutorial (Tech. Rep. No. CS-R9526). Amster-
dam: Centrum voor Wiskunde en Informatica (CWI).
Wagenmakers, E.-J., Beek, T., Rotteveel, M., Gierholz, A., Matzke, D., Ste-
ingroever, H., . . . Pinto, Y. (2015). Turning the hands of time again: A
purely conﬁrmatory replication study and a Bayesian analysis. Frontiers in
Psychology: Cognition,6:494 .
37
Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010).
Bayesian hypothesis testing for psychologists: A tutorial on the Savage–
Dickey method. Cognitive Psychology,60 , 158–189.
Wagenmakers, E.-J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, J.,
. . . Morey, R. D. (2018). Bayesian inference for psychology. Part II: Example
applications with JASP. Psychonomic Bulletin & Review,25 , 58–76.
Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., .. .
Morey, R. D. (2018). Bayesian inference for psychology. Part I: Theoretical
advantages and practical ramiﬁcations. Psychonomic Bulletin & Review,25 ,
35–57.
Wagenmakers, E.-J., Morey, R. D., & Lee, M. D. (2016). Bayesian beneﬁts for
the pragmatic researcher. Current Directions in Psychological Science,25 ,
169–176.
Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E.-J. (2009).
How to quantify support for and against the null hypothesis: A ﬂexible Win-
BUGS implementation of a default Bayesian ttest. Psychonomic Bulletin &
Review,16 , 752–760.
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van
Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in
planning, running, analyzing, and reporting psychological studies: A checklist
to avoid p-hacking. Frontiers in Psychology,7, 1832.
Wrinch, D., & Jeﬀreys, H. (1921). On certain fundamental principles of scientiﬁc
inquiry. Philosophical Magazine,42 , 369–390.
38
... no difference in mean), the alternative hypothesis, and also for the insensitivity of data. Also, it is easy to measure the support for one or another hypothesis [12]- [14]. That said, we also conducted a traditional frequentist approach to provide both points of view to the reader. ...
... Some advantages of Bayesian analysis versus frequentist include: evidence for both hypothesis (null and alternative), and easy interpretability of the parameters [12], [14]. Bayes analysis was conducted using JASP [20] 2 for visualization and the same R libraries 3 used by JASP to script the algorithms. ...
Article
Visual perceptual learning has been studied extensively and reported to enhance the perception of almost all types of training stimuli, from low- to high-level visual stimuli. Notably, high-level stimuli are often composed of multiple low-level features. Therefore, it is natural to ask whether training of high-level stimuli affects the perception of low-level stimuli and vice versa. In the present study, we trained subjects with either a high-level configuration stimulus or a low-level element stimulus. The high-level configuration stimulus consisted of two Gabors in the left and right visual fields, respectively, and the low-level element stimulus was the Gabor in the right visual field of the configuration stimulus. We measured the perceptual learning effects using the configuration stimulus and the element stimuli in both left and right visual fields. We found that the configuration perceptual learning equally improved the perception of the configuration stimulus and both element stimuli. In contrast, the element perceptual learning was confined to the trained element stimulus. These findings demonstrate an asymmetric relationship between perceptual learning of the configuration and the element stimuli and suggest a hybrid mechanism of the configuration perceptual learning. Our findings also offer a promising paradigm to promote the efficiency of perceptual learning-that is, gaining more learning effect with less training time.
Article
Background and objectives Repetition blindness (RB) refers to the difficulty to report repetitions of stimuli visually presented in a rapid list. To date only two studies have examined RB in patients with schizophrenia and the results are not clear-cut. The current study was designed to employ a task with reduced memory load, more trials in each experimental condition, and more participants to obtain a more reliable RB effect. Methods A 2x2x3x2 mixed factor repeated measure design was used, with stimulus repetition, lag, and presentation rate as within-subject factors, and group (patient or control) as a between-subject factor. A rapid serial visual presentation (RSVP) procedure was used. Twenty eight inpatients with schizophrenia and 28 healthy controls participated in the experiment. Results The patient group showed significantly impaired performance when compared tothe control group in every experimental condition. Nevertheless, the patient group demonstrated similar RB effect as the control group. Furthermore, the overall RB effect observed in patients did not relate to their illness severity or psychotic symptoms. Neither was it related to their age or education. Limitations It was difficult to match the age and education of the control group to that of the inpatient group. Conclusions Patients with schizophrenia performed worse than healthy controls in each experimental condition. Both the control and patient group showed robust RB effect in the short lag with faster rates. In addition, RB effect seemed to be irrelevant to patients’ illness severity and clinical symptoms.
Article
Full-text available
Article
Full-text available
Social media use research remains dominated by self-report measures, despite concerns they may not accurately reflect objective social media use. The association between commonly employed self-report measures and objective social media use remains unclear. The aim of this study was to determine the degree of association between an objective and commonly employed subjective measures of social media use. The study specifically examined a single-estimate self-report measure, a problematic social media use scale, and objective use derived from smartphone data, in a sample of 209 individuals. The findings showed a very weak non-significant relationship between the objective measure and the single-estimate measure, (r = −.04, p = .58, BF10 = 0.18), and a weak significant relationship between the objective measure and the problematic social media use scale (r = .19, p = .01, BF10 = 3.04). These findings converge with other recent research to suggest there is very little shared variance between subjective estimates of social media use and objective use. This highlights the possibility that subjective social media use may be largely unrelated to objective use, which has implications for ensuring the rigor of future research and raising potential concerns regarding the veracity of previous research.
Article
Full-text available
Accumulating evidence shows that the posterior cerebellum is involved in mentalizing inferences of social events by detecting sequence information in these events, and building and updating internal models of these sequences. By applying anodal and sham cerebellar transcranial direct current stimulation (tDCS) on the posteromedial cerebellum of healthy participants, and using a serial reaction time (SRT) task paradigm, the current study examined the causal involvement of the cerebellum in implicitly learning sequences of social beliefs of others (Belief SRT) and non-social colored shapes (Cognitive SRT). Apart from the social or cognitive domain differences, both tasks were structurally identical. Results of anodal stimulation (i.e., 2 mA for 20 min) during the social Belief SRT task, did not show significant improvement in reaction times, however it did reveal generally faster responses for the Cognitive SRT task. This improved performance could also be observed after the cessation of stimulation after 30 min, and up to one week later. Our findings suggest a general positive effect of anodal cerebellar tDCS on implicit non-social Cognitive sequence learning, supporting a causal role of the cerebellum in this learning process. We speculate that the lack of tDCS modulation of the social Belief SRT task is due to the familiar and overlearned nature of attributing social beliefs, suggesting that easy and automatized tasks leave little room for improvement through tDCS.
Article
Article
Full-text available
Objective A recent hypothesis suggests that functional somatic symptoms are due to altered information processing in the brain, with rigid expectations biasing sensorimotor signal processing. First experimental results confirmed such altered processing within the affected symptom modality, e.g., deficient eye-head coordination in patients with functional dizziness. Studies in patients with functional somatic symptoms looking at general, trans-symptomatic processing deficits are sparse. Here, we investigate sensorimotor processing during eye-head gaze shifts in irritable bowel syndrome (IBS) to test whether processing deficits exist across symptom modalities. Methods Study participants were seven patients suffering from IBS and seven age- and gender-matched healthy controls who performed large gaze shifts toward visual targets. Participants performed combined eye-head gaze shifts in the natural condition and with experimentally increased head moment of inertia. Head oscillations as a marker for sensorimotor processing deficits were assessed. Bayes statistics was used to assess evidence for the presence or absence of processing differences between IBS patients and healthy controls. Results With the head moment of inertia increased, IBS patients displayed more pronounced head oscillations than healthy controls (Bayes Factor 10 = 56.4, corresponding to strong evidence). Conclusion Patients with IBS show sensorimotor processing deficits, reflected by increased head oscillations during large gaze shifts to visual targets. In particular, patients with IBS have difficulties to adapt to the context of altered head moment of inertia. Our results suggest general transdiagnostic processing deficits in functional somatic disorders.
Article
Full-text available
It is still debated whether metacognition, or the ability to monitor our own mental states, relies on processes that are “domain-general” (a single set of processes can account for the monitoring of any mental process) or “domain-specific” (metacognition is accomplished by a collection of multiple monitoring modules, one for each cognitive domain). It has been speculated that two broad categories of metacognitive processes may exist: those that monitor primarily externally generated versus those that monitor primarily internally generated information. To test this proposed division, we measured metacognitive performance (using m-ratio, a signal detection theoretical measure) in four tasks that could be ranked along an internal-external axis of the source of information, namely memory, motor, visuomotor, and visual tasks. We found correlations between m-ratios in visuomotor and motor tasks, but no correlations between m-ratios in visual and visuomotor tasks, or between motor and memory tasks. While we found no correlation in metacognitive ability between visual and memory tasks, and a positive correlation between visuomotor and motor tasks, we found no evidence for a correlation between motor and memory tasks. This pattern of correlations does not support the grouping of domains based on whether the source of information is primarily internal or external. We suggest that other groupings could be more reflective of the nature of metacognition and discuss the need to consider other non-domain task-features when using correlations as a way to test the underlying shared processes between domains.
Article
Full-text available
Despite various studies examining intertemporal choice with hypothetical rewards due to problematic real reward delivery, there remains no substantial evidence on the effect of the incentives on the decision confidence and cognitive process in intertemporal choice and no comprehensive exploration on the loss domain. Hence, this study conducts an eye-tracking experiment to examine the effect of incentive approach and measure participants' decision confidence using a between-subject design in both gain and loss domains. Results replicated previous findings which show incentives do not affect intertemporal choice in the gain domain. In contrast, in the loss domain, participants in the incentivized group were more likely to choose the larger-later options than those in the non-incentivized group. Furthermore, the decision confidence and the mean fixation duration differed between the incentivized and non-incentivized groups in both gain and loss domains. These findings allow for a better understanding of the effect of incentives on intertemporal choice and provide valuable information for the design of incentives in future intertemporal experiments.
Article
Full-text available
An increasingly popular approach to statistical inference is to focus on the estimation of effect size. Yet this approach is implicitly based on the assumption that there is an effect while ignoring the null hypothesis that the effect is absent. We demonstrate how this common null-hypothesis neglect may result in effect size estimates that are overly optimistic. As an alternative to the current approach, a spike-and-slab model explicitly incorporates the plausibility of the null hypothesis into the estimation process. We illustrate the implications of this approach and provide an empirical example.
Article
Full-text available
Many statistical scenarios initially involve several candidate models that describe the data-generating process. Analysis often proceeds by first selecting the best model according to some criterion and then learning about the parameters of this selected model. Crucially, however, in this approach the parameter estimates are conditioned on the selected model, and any uncertainty about the model-selection process is ignored. An alternative is to learn the parameters for all candidate models and then combine the estimates according to the posterior probabilities of the associated models. This approach is known as Bayesian model averaging (BMA). BMA has several important advantages over all-or-none selection methods, but has been used only sparingly in the social sciences. In this conceptual introduction, we explain the principles of BMA, describe its advantages over all-or-none model selection, and showcase its utility in three examples: analysis of covariance, meta-analysis, and network analysis.
Article
Full-text available
This article explores whether the null hypothesis significance testing (NHST) framework provides a sufficient basis for the evaluation of statistical model assumptions. It is argued that while NHST-based tests can provide some degree of confirmation for the model assumption that is evaluated—formulated as the null hypothesis—these tests do not inform us of the degree of support that the data provide for the null hypothesis and to what extent the null hypothesis should be considered to be plausible after having taken the data into account. Addressing the prior plausibility of the model assumption is unavoidable if the goal is to determine how plausible it is that the model assumption holds. Without assessing the prior plausibility of the model assumptions, it remains fully uncertain whether the model of interest gives an adequate description of the data and thus whether it can be considered valid for the application at hand. Although addressing the prior plausibility is difficult, ignoring the prior plausibility is not an option if we want to claim that the inferences of our statistical model can be relied upon.
Article
Full-text available
In the psychological literature, there are two seemingly different approaches to inference: that from estimation of posterior intervals and that from Bayes factors. We provide an overview of each method and show that a salient difference is the choice of models. The two approaches as commonly practiced can be unified with a certain model specification, now popular in the statistics literature, called spike-and-slab priors. A spike-and-slab prior is a mixture of a null model, the spike, with an effect model, the slab. The estimate of the effect size here is a function of the Bayes factor, showing that estimation and model comparison can be unified. The salient difference is that common Bayes factor approaches provide for privileged consideration of theoretically useful parameter values, such as the value corresponding to the null hypothesis, while estimation approaches do not. Both approaches, either privileging the null or not, are useful depending on the goals of the analyst.
Article
Full-text available
Following a review of extant reporting standards for scientific publication, and reviewing 10 years of experience since publication of the first set of reporting standards by the American Psychological Association (APA; APA Publications and Communications Board Working Group on Journal Article Reporting Standards, 2008), the APA Working Group on Quantitative Research Reporting Standards recommended some modifications to the original standards. Examples of modifications include division of hypotheses, analyses, and conclusions into 3 groupings (primary, secondary, and exploratory) and some changes to the section on meta-analysis. Several new modules are included that report standards for observational studies, clinical trials, longitudinal studies, replication studies, and N-of-1 studies. In addition, standards for analytic methods with unique characteristics and output (structural equation modeling and Bayesian analysis) are included. These proposals were accepted by the Publications and Communications Board of APA and supersede the standards included in the 6th edition of the Publication Manual of the American Psychological Association (APA, 2010).
Article
Full-text available
Bayesian inference for rank-order problems is frustrated by the absence of an explicit likelihood function. This hurdle can be overcome by assuming a latent normal representation that is consistent with the ordinal information in the data: the observed ranks are conceptualized as an impoverished reflection of an underlying continuous scale, and inference concerns the parameters that govern the latent representation. We apply this generic data-augmentation method to obtain Bayesian counterparts of three popular rank-based tests: the rank sum test, the signed rank test, and Spearman's $\rho$.
Article
Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have no effect as much as we must identify those that do have an effect. The dominant statistical approaches used in neuroscience rely on P values and can establish the latter but not the former. This makes non-significant findings difficult to interpret: do they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence. Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect. Keysers et al. show why P values do not differentiate inconclusive null findings from those that provide important evidence for the absence of an effect. They provide a tutorial on how to use Bayesian hypothesis testing to overcome this issue.
Article
Further consideration is given to a family of power transformations discussed by Box and Cox (1964). It is shown that this family of transformations can be useful even in situations where no power transformation can produce normality exactly. The precision of estimation of the transformation parameter is also discussed.
Article
Hypothesis testing is a special form of model selection. Once a pair of competing models is fully defined, their definition immediately leads to a measure of how strongly each model supports the data. The ratio of their support is often called the likelihood ratio or the Bayes factor. Critical in the model-selection endeavor is the specification of the models. In the case of hypothesis testing, it is of the greatest importance that the researcher specify exactly what is meant by a “null” hypothesis as well as the alternative to which it is contrasted, and that these are suitable instantiations of theoretical positions. Here, we provide an overview of different instantiations of null and alternative hypotheses that can be useful in practice, but in all cases the inferential procedure is based on the same underlying method of likelihood comparison. An associated app can be found at https://osf.io/mvp53/. This article is the work of the authors and is reformatted from the original, which was published under a CC-By Attribution 4.0 International license and is available at https://psyarxiv.com/wmf3r/.