PreprintPDF Available

Evaluative Conditioning without awareness: Replicable effects do not equate replicable inferences

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Moran et al.’s (2020) primary analysis successfully replicated the surveillance task effect obtained by Olson and Fazio (2001). This effect is often treated as evidence for attitude formation in the absence of awareness. However, such an inference requires that ‘aware’ participants are successfully excluded from consideration. We present evidence that the awareness exclusion criterion used by Olson and Fazio (2001) – the only one to produce a significant effect in the replication – is a poor measure of awareness: it is overly lax, noisy, and demonstrates heterogeneity between sites. A new meta-analysis using a stricter compound awareness criterion that prioritized sensitivity (N = 665) demonstrated a non-significant and near-zero effect size (Hedges’ g = 0.00, p = .983, BF10 = 0.04). When subjected to a more severe test, Moran et al.’s (2020) data does not support the ‘unaware Evaluative Conditioning’ hypothesis. Results serve to highlight the importance of distinguishing between a replicable statistical effect and a replicable inference regarding a verbal hypothesis.
Content may be subject to copyright.
Evaluative Conditioning without awareness:
Replicable effects do not equate replicable inferences
Ian Hussey & Sean Hughes
Moran et al.’s (2020) primary analysis successfully replicated the surveillance task effect
obtained by Olson and Fazio (2001). This effect is often treated as evidence for attitude
formation in the absence of awareness. However, such an inference requires that ‘aware’
participants are successfully excluded from consideration. We present evidence that the
awareness exclusion criterion used by Olson and Fazio (2001) – the only one to produce
a significant effect in the replication – is a poor measure of awareness: it is overly lax,
noisy, and demonstrates heterogeneity between sites. A new meta-analysis using a
stricter compound awareness criterion that prioritized sensitivity (N = 665)
demonstrated a non-significant and near-zero effect size (Hedgesg = 0.00, p = .983,
BF10 = 0.04). When subjected to a more severe test, Moran et al.’s (2020) data does
not support the ‘unaware Evaluative Conditioning’ hypothesis. Results serve to
highlight the importance of distinguishing between a replicable statistical effect and a
replicable inference regarding a verbal hypothesis. All data and code available at
Olson and Fazio (2001) presented evidence that
changes in liking due to the pairing of stimuli (i.e.,
Evaluative Conditioning effects: ‘EC’) can take place
even when people are ‘unaware’ that stimuli have been
paired. Recently, Moran et al. (2020) conducted a close
replication of this work.
While Moran et al.’s (2020)
results replicated the original effect reported in Olson
and Fazio (2001), we argue that both Olson and Fazio
(2001) and Moran et al. (2020) represent weak tests of
the underlying verbal hypothesis of ‘unaware EC’.
Let us be clear: we are not arguing the EC effect
produced by Olson and Fazio’s (2001) surveillance task
does not replicate. The results of Moran et al. (2020)
indicate that it does. Rather, we are arguing that this
experimental setup is a poor test of the verbal
hypothesis that is ultimately of interest. In our opinion,
the surveillance task and awareness measures produced
replicable statistical effects, but unreplicable inferences
We are third and second authors (respectively) of Moran et al. (2020). Given the large number of authors involved in Moran et al.
(2020), there was a diverse set of opinions on the concept of ‘awareness’ and how results in that article should be interpreted. Moran et
al. (2020) represents the consensus opinion among that study’s authors, whereas this commentary provides our own opinions.
As Moran et al. (2020) note, there is debate as to whether the exclusion criteria capture ‘awareness’ of the stimulus pairings, ‘recollective
memory’ of this awareness, or both (see Gawronski & Walther, 2012; Jones et al., 2009). Here we refer to the criteria as measures of
awareness throughout the current article. Rather than focus on what is being measured, we focus on the more fundamental question of
whether they are reliable measures in the first place.
regarding the verbal hypothesis of ‘unaware Evaluative
Conditioning’ (distinction originally made by Vazire,
2019; see also Hussey & Hughes, 2020; Yarkoni, 2019).
To briefly recap, Moran et al. (2020) examined if
EC effects on the surveillance task were present when
four different awareness
exclusion criteria were applied
(i.e., the ‘Olson & Fazio, 2001’, ‘Olson & Fazio, 2001
modified’, ‘Bar-Anan, De Houwer, & Nosek, 2010’, and
‘Bar-Anan et al., 2010 modified’ criteria; for details of
each see Moran et al., 2020). Their primary analysis
was based on the original authors’ exclusion criterion
(i.e., ‘Olson & Fazio, 2001’) which, when applied, led
to a significant effect (Hedges’ g = 0.12, 95% CI [0.05,
0.20], p = .002). Applying any of the other three
(preregistered) secondary exclusion criteria did not lead
to significant EC effects (all gs = 0.03 to 0.05, all ps >
Of course, testing the ‘unaware EC’ hypothesis
requires a reliable and valid measure capable of
excluding participants who were ‘aware’ of the stimulus
pairings. What Olson and Fazio (2001; and, by
extension, Moran et al., 2020) failed to do, in our
opinion, was to consider the structural validity of the
awareness exclusion criteria. While Moran et al. (2020)
noted that “any attempt to detect differences in EC
effects between putatively ‘aware’ and ‘unaware’
participants will ultimately depend on the reliability of
the awareness measure(p. 23), and that such measures
are frequently unreliable (Shanks, 2017; Vadillo et al.,
2020), that article did not contain any direct
consideration of the structural validity of the awareness
measures. Recent work has argued that such issues
around measurement are common yet underappreciated
in psychology and serve to threaten the validity of our
findings and the conclusions we draw from them (Flake
et al., 2017; Flake & Fried, 2019; Hussey & Hughes,
In our opinion, the effect obtained in Moran et al.’s
(2020) primary analysis was driven by the fact that the
exclusion criterion used in that analysis failed to
exclude individuals who were aware, with the observed
effect driven by these ‘aware’ participants. In this paper
we (1) assess the validity of the four awareness criteria
and conclude that they are poor and noisy measures of
awareness, and (2) conduct a stricter test of the core
verbal hypothesis and conclude that the evidence does
not support ‘unaware EC’.
Poor measures of awareness
Reliability between criteria
As we previously mentioned, the ‘Olson and Fazio
(2001)’ criterion used in the primary analysis was the
only criterion under which a significant EC effect was
found. Importantly, it was also the most liberal one by
far: it scored only 8% of participants as ‘aware’,
whereas other exclusion criteria scored up to 48% of
participants as ‘aware’ (‘Olson & Fazio, 2001 modified
criterion = 31%; ‘Bar-Anan et al., 2010’ criterion =
48%; ‘Bar-Anan et al., 2010 modified’ criterion = 27%).
While these awareness rates were reported in Moran et
al. (2020), that article did not directly consider the
relationship between the criteria’s relative strictness
and the EC effects they produced.
What the above shows is that there were
meaningful differences in the exclusion rates observed
between criteria. If these measures demonstrated very
good measurement properties, this pattern of results
would be due to the measures differing only in their
relative strictness, in an everyday sense, rather than
there also being unreliability between them. In this
context, this question of differing ‘strictness (vs. mere
It is worth noting that the first author was responsible for the creation and distribution of the measures used in Moran et al. , and as
such is highly familiar with them and the efforts to standardize them between sites.
unreliability) is a quantifiable statistical property
referred to as the degree of conformity to a Guttman
structure, which is testable using methods from Item
Response Theory modelling. Specifically, if these
measures demonstrated perfect reliability and differed
only in strictness we would expect the proportion of
Guttman errors (G) to be very small (i.e., approach 0).
In contrast, if they were unreliable we would expect G
to approach 1.
Results demonstrated that measures were indeed
quite unreliable. Nearly half of participants had scores
on one or more awareness criteria that indicated such
errors, G = 47.5%, 95% CI [45.5, 49.5], G* = 11.9%,
95% CI [11.4, 12.4] (see Meijer, 1994; and see
Supplementary Materials for full details, data, code,
and results of all analyses: In other words,
in about half of participants, a supposedly more lenient
awareness criterion actually scored them more strictly
than a supposedly stricter criterion.
Reliability between sites
There was also a great deal of variation in the
exclusion rates between data collection sites. For
example, exclusion rates using the ‘Olson and Fazio
(2001) modified’ criterion varied between 15% and 74%
between sites. This was quantified using meta-analyses
of the proportion of ‘aware’ participants between sites
for each of the exclusion criteria. Results demonstrated
large between-site heterogeneity (all I2 = 54.7% to
91.7%, all H2 = 2.2 to 12 between the four criteria).
Differences in between-site awareness rates therefore
did not represent mere sampling variation but rather
large between-site heterogeneity. Given that all
measures and instructions were delivered to
participants in a standardized format, this degree of
heterogeneity represents evidence that the awareness
measures may not be as reliable or valid as assumed.
This could be attributed to the somewhat
subjective nature of the ‘Olson and Fazio (2001)’
criterion in particular, which (a) asks participants the
broad question of whether they “noticed anything odd
during the experiment”, (b) collects open-ended
responses, and (c) require these to be hand scored. This
method leaves room for a great degree of variation in
interpretation between participants and sites which
ultimately could lead many ‘aware’ participants to be
scored as unaware’. To take just one example, an
individual who is fully ‘aware’ of the pairings in the
surveillance task might reasonably consider the
stimulus pairings to be unremarkable and not odd at
all, but merely a normal and obvious part of the task,
respond as such, and therefore be scored incorrectly as
Figure 1. Forest plot of results.
The preceding two sections suggest that the
awareness criteria demonstrated poor reliability and
structural validity, and therefore likely failed to exclude
participants who were actually aware. In our opinion,
it was this that this led to the significant effect in
Moran et al.’s (2020) primary analysis (i.e., its reliance
on the worst of a bad bunch). If we want to conclude
that EC effects can be demonstrated in the absence of
awareness, then a more severe test of the verbal
hypothesis is required.
A severe test of the ‘unaware EC’ hypothesis
With this in mind, we created a stricter exclusion
criterion that favored sensitivity over specificity, and
therefore maximized our chances of excluding ‘aware’
participants. Specifically, we excluded participants if
any of the four criteria scored them as being aware.
This compound criterion excluded 54% of participants
as ‘aware’, leaving 665 in the analytic sample.
Before fitting a new meta-analysis model, we first
assessed the statistical power of this test given the
available sample size. This ensured that the results of
such a test would be meaningful. Using the same power
analysis method employed by Moran et al. (2020), to
detect an effect size as large as that observed in the
published literature (i.e., g = 0.20) with this sample
size, power was > .99. Stated another way, at power =
.80, the minimum detectable effect size was Cohen’s d
= 0.10. Power estimates were comparable when we
employed what we considered to be a more appropriate
method of power analysis for meta-analysis models (see
Valentine et al., 2010): to detect an effect size of d =
0.20, power was = .95. At power = .80, the minimum
detectable effect size was d = 0.16. The available
sample size was therefore concluded to demonstrate
adequate statistical power for our analysis, comparable
to Moran et al. (2020).
After excluding participants using the compound
criterion, we fitted a meta-analysis model that was
otherwise identical to that employed in Moran et al.’s
(2020) primary analysis. The meta-analyzed EC effect
was a non-significant, well-estimated effect size that
was exceptionally close to zero, Hedges’ g = 0.00, 95%
CI [-0.11, 0.10], p = .983. No heterogeneity was
observed between sites, I2 = 0.0%, H2 = 1.0 (see Figure
A Bayes Factor meta-analysis model using Rouder
and Morey’s (2011) method was also fitted to quantify
the evidence in favor of the null hypothesis. Default JZS
and Cauchy priors were employed to represent a weak
skeptical belief in the null hypothesis (location = 0;
scaling factor r = .707 on fixed effect for condition and
r = 1.0 on random effect for data collection site, see
Rouder & Morey 2011). Strong evidence was found in
favor of the null hypothesis (BF10 = 0.04, effect size δ
= 0.00, 95% HDI [-0.08, 0.07]).
Olson and Fazio’s (2001) study and Moran et al.’s
(2020) replication both rely on the successful exclusion
of ‘aware’ participants. However, neither study assessed
the reliability or validity of their awareness criteria.
Our analyses suggest that the criteria are, individually,
relatively poor measures of awareness that likely fail to
exclude ‘aware’ participants. We created a stricter
awareness exclusion criterion that prioritized sensitivity
by combining all four into a compound exclusion
criterion. When subjected to this more severe test,
Moran et al.’s (2020) data does not support the
‘unaware Evaluative Conditioning’ hypothesis.
Results serve to highlight the importance of
distinguishing between a replicable statistical effect and
a replicable inference regarding a verbal hypothesis of
interest (Vazire, 2019; see Yarkoni, 2019), as well as
highlighting the need to pay greater attention to
RE Model (I2 = 0.0%, H2 = 1.0)
0.8 0.4 0 0.4 0.8 1.2
Hedges' g
0.01 [0.26, 0.25]
0.01 [0.36, 0.34]
0.09 [0.32, 0.50]
0.24 [0.53, 1.02]
0.10 [0.28, 0.49]
0.18 [0.20, 0.55]
0.05 [0.35, 0.24]
0.08 [0.28, 0.44]
0.14 [0.67, 0.40]
0.04 [0.34, 0.42]
0.29 [0.64, 0.07]
0.05 [0.44, 0.34]
0.00 [0.11, 0.10]
Site Hedges' g [95% CI]
measurement if our inferences are to be both replicable
and valid. Such calls have been made within other areas
of psychology (see Flake et al., 2017; Flake & Fried,
2019; Hussey & Hughes, 2020), but rarely within
experimental social psychology.
Finally, as coauthors of Moran et al. (2020), we
regret that we did not consider creating this compound
criterion prior to the preregistration of the replication.
Preregistration prior to seeing the results of the primary
tests would have increased the evidential weight of the
current results. However, the concept of evidential
weight is at the core of our critique here: as Moran et
al. (2020) note in their discussion, claims for the
replicability of support for the verbal hypothesis of
‘unaware EC’ have far reaching implications, and such
claims require strong evidence. We feel that the general
trend of evidence, across Moran et al.’s (2020) analyses
and those reported here, is against ‘unaware EC’.
Author contributions. IH conceptualized the study and
analyzed the data. SH provided critical input into the
design and analysis. Both authors wrote the article and
approved the final submitted version of the manuscript.
Declaration of Conflicting Interests. We declare we
have no conflicts of interest with respect to the
research, authorship, and/or publication of this article.
Funding. This research was conducted with the support
of Ghent University grant 01P05517 to IH and
BOF16/MET_V/002 to Jan De Houwer.
Bar-Anan, Y., Houwer, J. D., & Nosek, B. A. (2010).
Evaluative conditioning and conscious knowledge
of contingencies: A correlational investigation
with large samples. The Quarterly Journal of
Experimental Psychology, 63(12), 2313–2335.
Flake, J. K., & Fried, E. I. (2019). Measurement
Schmeasurement: Questionable Measurement
Practices and How to Avoid Them. Preprint.
Flake, J. K., Pek, J., & Hehman, E. (2017). Construct
Validation in Social and Personality Research:
Current Practice and Recommendations. Social
Psychological and Personality Science, 8(4), 370–
Gawronski, B., & Walther, E. (2012). What do
memory data tell us about the role of contingency
awareness in evaluative conditioning? Journal of
Experimental Social Psychology, 48(3), 617–623.
Hussey, I., & Hughes, S. (2020). Hidden invalidity
among fifteen commonly used measures in social
and personality psychology. Advances in Methods
and Practices in Psychological Science, In Press.
Jones, C. R., Fazio, R. H., & Olson, M. A. (2009).
Implicit Misattribution as a Mechanism
Underlying Evaluative Conditioning. Journal of
Personality and Social Psychology, 96(5), 933–
Meijer, R. R. (1994). The Number of Guttman Errors
as a Simple and Powerful Person-Fit Statistic.
Applied Psychological Measurement, 18(4), 311–
Moran, T., Hughes, S., Hussey, I., Vadillo, M. A.,
Olson, M. A., Aust, F., Bading, K., Balas, R.,
Benedick, T., Corneille, O., Douglas, S. B.,
Ferguson, M. J., Fritzlen, K. A., Gast, A.,
Gawronski, B., Giménez-Fernández, T., Hanusz,
K., Heycke, T., Högden, F., … De Houwer, J.
(2020). Incidental Attitude Formation via the
Surveillance Task: A Registered Replication
Report of Olson and Fazio (2001). Psychological
Science, (Stage 1 acceptance).
Olson, M. A., & Fazio, R. H. (2001). Implicit Attitude
Formation Through Classical Conditioning.
Psychological Science, 12(5), 413–417.
Rouder, J. N., & Morey, R. D. (2011). A Bayes factor
meta-analysis of Bem’s ESP claim. Psychonomic
Bulletin & Review, 18(4), 682–689.
Shanks, D. R. (2017). Regressive research: The pitfalls
of post hoc data selection in the study of
unconscious mental processes. Psychonomic
Bulletin & Review, 24(3), 752–775.
Vadillo, M. A., Linssen, D., Orgaz, C., Parsons, S., &
Shanks, D. R. (2020). Unconscious or
underpowered? Probabilistic cuing of visual
attention. Journal of Experimental Psychology:
General, 149(1), 160–181.
Valentine, J. C., Pigott, T. D., & Rothstein, H. R.
(2010). How Many Studies Do You Need?: A
Primer on Statistical Power for Meta-Analysis.
Journal of Educational and Behavioral Statistics,
35(2), 215–247.
Vazire, S. (2019). “Thoughts inspired by the @replicats
workshop: Replicability of Evidence asks ‘Would I
get consistent evidence if I did the same thing
again?’ Replicability of Inferences asks ‘Would
others draw the same inference from this evidence
as the claim in the paper?’ (1/5).” [Tweet].
Yarkoni, T. (2019). The Generalizability Crisis.
... Moreover, additional exploratory analyses conducted on the present data by some of the coauthors suggest that there is no good evidence for unaware evaluative-conditioning effects. For example, an analysis of our data that distinguishes between independent sets of fully aware, partially aware, and fully unaware participants found a nonsignificant evaluative-conditioning effect in fully unaware participants (Stahl & Corneille, 2020); a meta-analysis using a stricter compound awareness criterion that prioritized sensitivity to awareness found a nonsignificant and near-zero effect (Hussey & Hughes, 2020); and a Bayesian analysis of the data did not provide convincing evidence in favor of an unaware evaluative-conditioning effect under any of the exclusion criteria (Kurdi & Ferguson, 2020). 6 Second, the success of a replication can also be defined in ways other than statistical significance, which may aid the interpretation of the results. ...
... Although our primary analysis demonstrated that Olson and Fazio's (2001) surveillance-task effect was replicated, these conceptual concerns raise questions as to whether this procedure represents a useful test of the unaware-evaluative-conditioning hypothesis. Retrospective reports of awareness are imperfect in that they may misclassify participants as unaware or vice versa (but see Hussey & Hughes, 2020). Nonetheless, data based on retrospective measures, such as those used here, likely cannot settle the question of whether evaluative-conditioning effects can emerge in the absence of awareness by themselves. ...
Full-text available
Evaluative conditioning is one of the most widely studied procedures for establishing and changing attitudes. The surveillance task is a highly cited evaluative-conditioning paradigm and one that is claimed to generate attitudes without awareness. The potential for evaluative-conditioning effects to occur without awareness continues to fuel conceptual, theoretical, and applied developments. Yet few published studies have used this task, and most are characterized by small samples and small effect sizes. We conducted a high-powered (N = 1,478 adult participants), preregistered close replication of the original surveillance-task study (Olson & Fazio, 2001). We obtained evidence for a small evaluative-conditioning effect when “aware” participants were excluded using the original criterion—therefore replicating the original effect. However, no such effect emerged when three other awareness criteria were used. We suggest that there is a need for caution when using evidence from the surveillance-task effect to make theoretical and practical claims about “unaware” evaluative-conditioning effects.
... not to be observed in other EC paradigms. In particular, EC paradigms relying on more incidental CS-US presentations might be more conducive to an implicit acquisition of attitudes (Jones et al., 2009;Moran, 2020; but see Kurdi & Ferguson, 2020;Hussey & Hughes, 2020;Stahl & Corneille, 2020). Third, memory data ultimately inform us about the role of retrieval at the evaluation stage, not necessarily about the role of awareness at encoding. ...
Full-text available
Research on Evaluative Conditioning (EC) that relied on a Process Dissociation (PD) procedure supports the possibility of attitude learning effects acquired or maintained in the absence of explicit memory. In the present research we argue that basic assumptions inherent to the PD procedure are both theoretically and empirically unjustified. We introduce and empirically validate an alternative fine-grained assessment of subjective memory states. The data of the validation study (i) question central assumptions of the PD procedure, (ii) fail to support the assumptions posited in dual-learning models of attitudes that attitudes can be acquired or maintained without memory, and (iii) highlight the importance of distinguishing between objective and subjective memory. We discuss the implications of the present findings for attitude research.
Full-text available
Recent debate about the reliability of psychological research has raised concerns about the prevalence of false positives in our discipline. However, false negatives can be just as concerning in areas of research that depend on finding support for the absence of an effect. This risk is particularly high in unconscious learning experiments, where researchers commonly seek to demonstrate that people can learn to perform a task in the absence of any explicit knowledge of the information that drives performance. The fact that some unconscious learning effects are typically studied with small samples and unreliable awareness measures makes false negatives especially likely. In the present article we focus on a popular unconscious learning paradigm, probabilistic cuing of visual attention, as a case study. Firstly, we show that, at the meta-analytic level, previous experiments reveal positive signs of participant awareness, although individual studies are severely underpowered to detect this. Secondly, we report the results of two empirical studies in which participants’ awareness was tested with alternative and more sensitive dependent measures, both of which manifest positive evidence of awareness. We also show that, based on the predictions of a formal model of probabilistic cuing and given the reliabilities of the dependent measures collected in these experiments, any statistical test aimed at detecting a significant correlation between learning and awareness is doomed to return a non-significant result, even if at the latent level both constructs are actually related and participants’ knowledge is completely explicit.
Full-text available
In this paper we define questionable measurement practices as decisions researchers make that leave questions about the measures in a study unanswered. This makes it impossible to evaluate a wide range of potential validity threats to the study’s conclusions. We demonstrate that psychology is plagued by a measurement schmeasurement attitude: QMPs are common, offer a stunning source of researcher degrees of freedom, pose a serious threat to cumulative psychological science, but are largely ignored. We address these challenges by providing a set of questions that researchers and consumers of scientific research can consider to identify and avoid QMPs. Transparent answers to these measurement questions promote rigorous research, allow for thorough evaluations of a study’s inferences, and are necessary for meaningful replication studies.
Full-text available
Flake, Pek, and Hehman (2017) recently demonstrated that metrics of structural validity are severely underreported in social and personality psychology. We apply their recommendations for the comprehensive assessment of structural validity to a uniquely large and varied dataset (N= 144496 experimental sessions) to investigate the psychometric properties of some of the most widely used self-report measures (k= 15 questionnaires, 26 subscales) in social and personality psychology. When assessed using the modal practice of considering only their internal consistency, 89% of scales appeared to possess good validity. Yet, when validity was assessed comprehensively (via internal consistency, immediate and delayed test-retest reliability, factor structure, and measurement invariance for median age and gender) only 4% demonstrated good validity. Furthermore, the less commonly a test is reported in the literature, the more likely it was to be failed(e.g., measurement invariance). This suggests that the pattern of under-reporting in the field may represent widespread hidden invalidity of the measures we use, and therefore poses a threat to many research findings. We highlight the degrees of freedom afforded to researchers in the assessment and reporting of structural validity. Similar to the better-known concept of p-hacking, we introduce the concept of validity hacking (v-hacking) and argue that it should be acknowledged and addressed.
Full-text available
Many studies of unconscious processing involve comparing a performance measure (e.g., some assessment of perception or memory) with an awareness measure (such as a verbal report or a forced-choice response) taken either concurrently or separately. Unconscious processing is inferred when above-chance performance is combined with null awareness. Often, however, aggregate awareness is better than chance, and data analysis therefore employs a form of extreme group analysis focusing post hoc on participants, trials, or items where awareness is absent or at chance. The pitfalls of this analytic approach are described with particular reference to recent research on implicit learning and subliminal perception. Because of regression to the mean, the approach can mislead researchers into erroneous conclusions concerning unconscious influences on behavior. Recommendations are made about future use of post hoc selection in research on unconscious cognition.
Full-text available
In this article, the authors outline methods for using fixed and random effects power analysis in the context of meta-analysis. Like statistical power analysis for primary studies, power analysis for meta-analysis can be done either prospectively or retrospectively and requires assumptions about parameters that are unknown. The authors provide some suggestions for thinking about these parameters, in particular for the random effects variance component. The authors also show how the typically uninformative retrospective power analysis can be made more informative. The authors then discuss the value of confidence intervals, show how they could be used in addition to or instead of retrospective power analysis, and also demonstrate that confidence intervals can convey information more effectively in some situations than power analyses alone. Finally, the authors take up the question ‘‘How many studies do you need to do a meta-analysis?’’ and show that, given the need for a conclusion, the answer is ‘‘two studies,’’ because all other synthesis techniques are less transparent and/or are less likely to be valid. For systematic reviewers who choose not to conduct a quantitative synthesis, the authors provide suggestions for both highlighting the current limitations in the research base and for displaying the characteristics and results of studies that were found to meet inclusion criteria.
Full-text available
A number of studies have examined the power of several statistics that can be used to detect examinees with unexpected (nonfitting) item score patterns, or to determine person fit. This study compared the power of the U3 statistic with the power of one of the sim plest person-fit statistics, the sum of the number of Guttman errors. In most cases studied, (a weighted version of) the latter statistic performed as well as the U3 statistic. Counting the number of Guttman errors seems to be a useful and simple alternative to more complex statistics for determining person fit. Index terms: aberrance detection, appropriateness measure ment, Guttman errors, nonparametric item response theory, person fit.
Full-text available
In recent years, statisticians and psychologists have provided the critique that p-values do not capture the evidence afforded by data and are, consequently, ill suited for analysis in scientific endeavors. The issue is particular salient in the assessment of the recent evidence provided for ESP by Bem (2011) in the mainstream Journal of Personality and Social Psychology. Wagenmakers, Wetzels, Borsboom, and van der Maas (Journal of Personality and Social Psychology, 100, 426-432, 2011) have provided an alternative Bayes factor assessment of Bem's data, but their assessment was limited to examining each experiment in isolation. We show here that the variant of the Bayes factor employed by Wagenmakers et al. is inappropriate for making assessments across multiple experiments, and cannot be used to gain an accurate assessment of the total evidence in Bem's data. We develop a meta-analytic Bayes factor that describes how researchers should update their prior beliefs about the odds of hypotheses in light of data across several experiments. We find that the evidence that people can feel the future with neutral and erotic stimuli to be slight, with Bayes factors of 3.23 and 1.57, respectively. There is some evidence, however, for the hypothesis that people can feel the future with emotionally valenced nonerotic stimuli, with a Bayes factor of about 40. Although this value is certainly noteworthy, we believe it is orders of magnitude lower than what is required to overcome appropriate skepticism of ESP.
Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferential statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and statistical expressions of a hypothesis being closely aligned—that is, that the two must refer to roughly the same set of hypothetical observations. Here I argue that most inferential statistical tests in psychology fail to meet this basic condition. I demonstrate how foundational assumptions of the "random effects" model used pervasively in psychology impose far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints dramatically inflates false positive rates and routinely leads researchers to draw sweeping verbal generalizations that lack any meaningful connection to the statistical quantities they are putatively based on. I argue that the routine failure to consider the generalizability of one's conclusions from a statistical perspective lies at the root of many of psychology's ongoing problems (e.g., the replication crisis), and conclude with a discussion of several potential avenues for improvement.
The verity of results about a psychological construct hinges on the validity of its measurement, making construct validation a fundamental methodology to the scientific process. We reviewed a representative sample of articles published in the Journal of Personality and Social Psychology for construct validity evidence. We report that latent variable measurement, in which responses to items are used to represent a construct, is pervasive in social and personality research. However, the field does not appear to be engaged in best practices for ongoing construct validation. We found that validity evidence of existing and author-developed scales was lacking, with coefficient alpha often being the only psychometric evidence reported. We provide a discussion of why the construct validation framework is important for social and personality researchers and recommendations for improving practice. 3
Evaluative conditioning (EC) refers to the effect that pairings of a conditioned stimulus (CS) with a valenced unconditioned stimulus (US) lead to changes in the evaluation of the CS. There have been recurring debates about whether EC requires awareness of the contingency between CSs and USs during learning. We argue that the memory performance data obtained in the standard paradigm remain ambiguous about the role of contingency awareness during the encoding of CS–US pairings. First, memory performance data are unable to distinguish between encoding-related versus retrieval-related effects. Second, the relation between memory performance and evaluation is correlational, which limits conclusions about causal relations between memory performance and EC effects. These ambiguities imply that any possible data pattern can be interpreted in at least two different ways. It is concluded that a resolution of the current debate requires alternative approaches in which contingency awareness is experimentally manipulated during the encoding of CS–US pairings.