Article

Making replication mainstream

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Many philosophers of science and methodologists have argued that the ability to repeat studies and obtain similar results is an essential component of science. A finding is elevated from single observation to scientific evidence when the procedures that were used to obtain it can be reproduced and the finding itself can be replicated. Recent replication attempts show that some high profile results---most notably in psychology, but in many other disciplines as well---cannot be replicated consistently. These replication attempts have generated a considerable amount of controversy and the issue of whether direct replications have value has, in particular, proven to be contentious. However, much of this discussion has occurred in published commentaries and social media outlets, resulting in a fragmented discourse. To address the need for an integrative summary, we review various types of replication studies and then discuss the most commonly voiced concerns about direct replication. We provide detailed responses to these concerns and consider different statistical ways to evaluate replications. We conclude there are no theoretical or statistical obstacles to making direct replication a routine aspect of psychological science.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Replicability is considered to provide a metric of robustness within a scientific discipline through direct or conceptual replications (Romero, 2019;Zwaan et al., 2018) and information about the generalizability of a theoretical framework (Irvine, 2021). The inconsistency among these three metrics reiterate the challenge of quantifying replication and poses a conceptual question: what does it truly mean to replicate a study? ...
... These issues are confounded in developmental research on environmental stress as a product of overlapping measures and competing theories (Smith and Pollak, 2020). The nature of economic interests (Mischel, 2008;Romero, 2019), high rate of psychologists self-reporting engagement in questionable research practices (John et al., 2012), pressing need for publishing significant (Romero, 2019;Zwaan et al., 2018) and novel findings (Proulx and Morey, 2021) and 'hindsight bias' (Klapwijk et al., 2021;Zwaan et al., 2018) makes it challenging to decipher which specific researcher decisions would reproduce estimates in the data. ...
... These issues are confounded in developmental research on environmental stress as a product of overlapping measures and competing theories (Smith and Pollak, 2020). The nature of economic interests (Mischel, 2008;Romero, 2019), high rate of psychologists self-reporting engagement in questionable research practices (John et al., 2012), pressing need for publishing significant (Romero, 2019;Zwaan et al., 2018) and novel findings (Proulx and Morey, 2021) and 'hindsight bias' (Klapwijk et al., 2021;Zwaan et al., 2018) makes it challenging to decipher which specific researcher decisions would reproduce estimates in the data. ...
Article
Full-text available
Increasing evidence demonstrates that environmental factors meaningfully impact the development of the brain (Hyde et al., 2020; McEwen and Akil, 2020). Recent work from the Adolescent Brain Cognitive Development (ABCD) Study® suggests that puberty may indirectly account for some association between the family environment and brain structure and function (Thijssen et al., 2020). However, a limited number of large studies have evaluated what, how, and why environmental factors impact neurodevelopment. When these topics are investigated, there is typically inconsistent operationalization of variables between studies which may be measuring different aspects of the environment and thus different associations in the analytic models. Multiverse analyses (Steegen et al., 2016) are an efficacious technique for investigating the effect of different operationalizations of the same construct on underlying interpretations. While one of the assets of Thijssen et al. (2020) was its large sample from the ABCD data, the authors used an early release that contained 38% of the full ABCD sample. Then, the analyses used several ‘researcher degrees of freedom’ (Gelman and Loken, 2014) to operationalize key independent, mediating and dependent variables, including but not limited to, the use of a latent factor of preadolescents' environment comprised of different subfactors, such as parental monitoring and child-reported family conflict. While latent factors can improve reliability of constructs, the nuances of each subfactor and measure that comprise the environment may be lost, making the latent factors difficult to interpret in the context of individual differences. This study extends the work of Thijssen et al. (2020) by evaluating the extent to which the analytic choices in their study affected their conclusions. In Aim 1, using the same variables and models, we replicate findings from the original study using the full sample in Release 3.0. Then, in Aim 2, using a multiverse analysis we extend findings by considering nine alternative operationalizations of family environment, three of puberty, and five of brain measures (total of 135 models) to evaluate the impact on conclusions from Aim 1. In these results, 90% of the directions of effects and 60% of the p-values (e.g. p > .05 and p < .05) across effects were comparable between the two studies. However, raters agreed that only 60% of the effects had replicated. Across the multiverse analyses, there was a degree of variability in beta estimates across the environmental variables, and lack of consensus between parent reported and child reported pubertal development for the indirect effects. This study demonstrates the challenge in defining which effects replicate, the nuance across environmental variables in the ABCD data, and the lack of consensus across parent and child reported puberty scales in youth.
... We ask three questions (cf. [33]) that all aim to evaluate to what extent the present study is a successful replication [34] (cf. [35,36]) of our previous work [20]. ...
... Still, the marked difference in results of both studies is surprising: they are similar enough that the present study [34] could be considered a direct replication of the first study [20]. We used the exact same electrode montage and tDCS parameters, followed the same experimental design (Fig 1), ran the experiment in the same location with the same participant population, and used a virtually identical task. ...
Article
Full-text available
The attentional blink (AB) phenomenon reveals a bottleneck of human information processing: the second of two targets is often missed when they are presented in rapid succession among distractors. In our previous work, we showed that the size of the AB can be changed by applying transcranial direct current stimulation (tDCS) over the left dorsolateral prefrontal cortex (lDLPFC) (London & Slagter, Journal of Cognitive Neuroscience , 33 , 756–68, 2021). Although AB size at the group level remained unchanged, the effects of anodal and cathodal tDCS were negatively correlated: if a given individual’s AB size decreased from baseline during anodal tDCS, their AB size would increase during cathodal tDCS, and vice versa. Here, we attempted to replicate this finding. We found no group effects of tDCS, as in the original study, but we no longer found a significant negative correlation. We present a series of statistical measures of replication success, all of which confirm that both studies are not in agreement. First, the correlation here is significantly smaller than a conservative estimate of the original correlation. Second, the difference between the correlations is greater than expected due to sampling error, and our data are more consistent with a zero-effect than with the original estimate. Finally, the overall effect when combining both studies is small and not significant. Our findings thus indicate that the effects of lDPLFC-tDCS on the AB are less substantial than observed in our initial study. Although this should be quite a common scenario, null findings can be difficult to interpret and are still under-represented in the brain stimulation and cognitive neuroscience literatures. An important auxiliary goal of this paper is therefore to provide a tutorial for other researchers, to maximize the evidential value from null findings.
... Social scientists, like their counterparts in more established fields such as chemistry, physics, and biology, strive to uncover predictable regularities about the world. However, psychology, economics, management, and related fields have become embroiled in controversies as to whether the claimed discoveries are reliable (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11). When reading a research report, is it sensible to assume the finding is a true positive rather than a false positive (12,13)? ...
... One common counterexplanation for evidence that a scientific finding is not as reliable as initially expected is that it holds in the original context but not in some other contexts-for example, due to cultural differences or changes in situations over time (14)(15)(16)(17)(18)(19). Taken to the extreme, however, this explanation converts research reports into case studies with little to say about other populations and situations, such that findings and theories are rendered unfalsifiable (11,20,21). The multilaboratory replication efforts thus far suggest that experimental laboratory effects either generally hold across samples, including those in different nations, or consistently fail to replicate across sites (22)(23)(24)(25)(26). ...
Article
Full-text available
This initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports together with 55% of tests in different spans of years and 40% of tests in novel geographies. Some original findings were associated with multiple new tests. Reproducibility was the best predictor of generalizability—for the findings that proved directly reproducible, 84% emerged in other available time periods and 57% emerged in other geographies. Overall, only limited empirical evidence emerged for context sensitivity. In a forecasting survey, independent scientists were able to anticipate which effects would find support in tests in new samples.
... Theoretical discussions of replication offer a variety of functions for replication, which we might group broadly into epistemic and demarcating functions (e.g., Schmidt, 2009;Zwaan et al., 2018). ...
... Rather, we should view replications and experiments as having similar function. The view I have been resisting casts experiment and replication in different roles: experiments lay bricks in a growing scientific edifice, and replications test the soundness of those bricks (Zwaan et al., 2018). Against this picture, it has been suggested that we should view scientific progress not on the model of building a wall, but of assembling a puzzle (Tullett & Vazier, 2018). ...
Article
Replications are often taken to play both epistemic and demarcating roles in science: they provide evidence about the reliability of fields’ methods and, by extension, about which fields “count” as scientific. I argue that, in a field characterized by a high degree of theoretical openness and uncertainty, like comparative cognition, replications do not sit well in these roles. Like other experiments conducted under conditions of uncertainty, replications are often equivocal and open to interpretation. As a result, they are poorly placed to deliver clear judgments about the reliability of comparative cognition’s methods or its scientific bona fides. I suggest that this should encourage us to take a broader view of both the nature of scientific progress and the role of replication in comparative cognition.
... Social scientists, like their counterparts in more established fields such as chemistry, physics, and biology, strive to uncover predictable regularities about the world. However, psychology, economics, management, and related fields have become embroiled in controversies as to whether the claimed discoveries are reliable (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11). When reading a research report, is it sensible to assume the finding is a true positive rather than a false positive (12,13)? ...
... One common counterexplanation for evidence that a scientific finding is not as reliable as initially expected is that it holds in the original context but not in some other contexts-for example, due to cultural differences or changes in situations over time (14)(15)(16)(17)(18)(19). Taken to the extreme, however, this explanation converts research reports into case studies with little to say about other populations and situations, such that findings and theories are rendered unfalsifiable (11,20,21). The multilaboratory replication efforts thus far suggest that experimental laboratory effects either generally hold across samples, including those in different nations, or consistently fail to replicate across sites (22)(23)(24)(25)(26). ...
Article
Full-text available
This initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports together with 55% of tests in different spans of years and 40% of tests in novel geographies. Some original findings were associated with multiple new tests. Reproducibility was the best predictor of generalizability—for the findings that proved directly reproducible, 84% emerged in other available time periods and 57% emerged in other geographies. Overall, only limited empirical evidence emerged for context sensitivity. In a forecasting survey, independent scientists were able to anticipate which effects would find support in tests in new samples.
... Social scientists, like their counterparts in more established fields such as chemistry, physics, and biology, strive to uncover predictable regularities about the world. However, psychology, economics, management, and related fields have become embroiled in controversies as to whether the claimed discoveries are reliable (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11). When reading a research report, is it sensible to assume the finding is a true positive rather than a false positive (12,13)? ...
... One common counter-explanation for evidence that a scientific finding is not as reliable as initially expected is that it holds in the original context, but not in some other contexts-for example due to cultural differences or changes in situations over time (14)(15)(16)(17)(18)(19). Taken to the extreme, however, this explanation converts research reports into case studies with little to say about other populations and situations, such that findings and theories are rendered unfalsifiable (11,20,21). The multi-lab replication efforts thus far suggest that experimental laboratory effects either generally hold across samples, including those in different nations, or consistently fail to replicate across sites (22-26). ...
Article
Full-text available
This initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports together with 55% of tests in different spans of years and 40% of tests in novel geographies. Some original findings were associated with multiple new tests. Reproducibility was the best predictor of generalizability—for the findings that proved directly reproducible, 84% emerged in other available time periods and 57% emerged in other geographies. Overall, only limited empirical evidence emerged for context sensitivity. In a forecasting survey, independent scientists were able to anticipate which effects would find support in tests in new samples.
... How similar do they have to be in order to be considered replicated? Studies on replicability typically use the following information or a combination thereof: confidence intervals or Bayesian credibility intervals (Jacob et al., 2019), power (Simonsohn, 2015;Zwaan et al., 2018), replication Bayes Factor (Zwaan et al., 2018), and effect size (Simonsohn, 2015). Despite these various solutions, many problems remain unsolved. ...
... How similar do they have to be in order to be considered replicated? Studies on replicability typically use the following information or a combination thereof: confidence intervals or Bayesian credibility intervals (Jacob et al., 2019), power (Simonsohn, 2015;Zwaan et al., 2018), replication Bayes Factor (Zwaan et al., 2018), and effect size (Simonsohn, 2015). Despite these various solutions, many problems remain unsolved. ...
Preprint
Full-text available
Intensive longitudinal studies typically examine phenomena that vary across time, individuals, contexts, and other boundary conditions. This poses challenges to the conceptualization and identification of replicability and generalizability, which refer to the invariance of research findings across samples and contexts as crucial criteria for trustworthiness. Some of these challenges are specific to intensive longitudinal studies, others are similarly relevant for the work with other complex datasets that contain multilayered sources of variation (individuals nested in different types of activities or organizations, regions, countries, etc.)This article opens with discussing the reasons why research findings may fail to replicate. We then analyze reasons why research findings may falsely appear to be non-replicable when in fact they were as such replicable, but lacked generalizability due to heterogeneity between samples, subgroups, individuals, time points, and contexts. Following that, we propose conceptual and methodological approaches to better disentangle non-replicability from non-generalizability and to better understand the exact causes of either problem. In particular, we apply Lakatos’s proposition to examine not only whether but under what boundary conditions a theory is a useful description of the world, to the question whether and under which conditions a research finding is replicable and generalizable. Not only will that contribute to a more systematic understanding of and research on replicability and generalizability in longitudinal studies and beyond, but it will also be a contribution to what has been called the heterogeneity revolution (Bryan et al., 2021; Moeller, 2021).
... Recent years have seen growing concerns regarding the replicability and reproducibility of scientific findings. There have been multiple failures to replicate past findings that were previously considered credible and published in reputable journals (Klein et al., 2014, Open Science Collaboration, 2015, with calls for a science reform that, among other things, encourages more replication work (Zwaan et al., 2018). As such, a replication and extension of the classic outcome bias experiment and findings can update our knowledge about the effect and help gain important insights that would inform future research in the field. ...
Preprint
Full-text available
Outcome bias is the phenomenon whereby decisions which resulted in successful outcomes were rated more favorably than when the same decisions resulted in failures. In a pre-registered experiment with an online Amazon Mechanical Turk sample (N = 692), we conducted a pre-registered replication and extension of Experiment 1 from the classic Baron and Hershey (1988), using a between-subject design. We found support for an outcome bias with stronger effects than in the original, and even for participants who stated that outcomes should not be taken into consideration when evaluating decisions. In an extension, we found differences, dependent on outcome types, in evaluations of the perceived importance of considering the outcome, the perceived responsibility of decision makers, and the perception that others would act similarly given the choice by outcome type. We discuss future direction to elucidate the mechanisms driving outcome bias. Materials, data, and code are available on: https://osf.io/knjhu/.
... Thus, they should be included in future research about pediatric peripheral and neural correlates of self-harm. These can be guided by some innovative research to determine which studies in a body of work should undergo replication [150][151][152]. ...
Article
Full-text available
Background Self-harm in children and adolescents is difficult to treat. Peripheral and neural correlates of self-harm could lead to biomarkers to guide precision care. We therefore conducted a scoping review of research on peripheral and neural correlates of self-harm in this age group. Methods PubMed and Embase databases were searched from January 1980-May 2020, seeking English language peer-reviewed studies about peripheral and neural correlates of self-harm, defined as completed suicide, suicide attempts, suicidal ideation, or non-suicidal self-injury (NSSI) in subjects, birth to 19 years of age. Studies were excluded if only investigating self-harm in persons with intellectual or developmental disability syndromes. A blinded multi-stage assessment process by pairs of co-authors selected final studies for review. Risk of bias estimates were done on final studies. Results We screened 5537 unduplicated abstracts, leading to the identification of 79 eligible studies in 76 papers. Of these, 48 investigated peripheral correlates and 31 examined neural correlates. Suicidality was the focus in 2/3 of the studies, with NSSI and any type of self-harm (subjects recruited with suicidality, NSSI, or both) investigated in the remaining studies. All studies used observational designs (primarily case-control), most used convenience samples of adolescent patients which were predominately female and half of which were recruited based on a disorder. Over a quarter of the specific correlates were investigated with only one study. Inter-study agreement on findings from specific correlates with more than one study was often low. Estimates of Good for risk of bias were assigned to 37% of the studies and the majority were rated as Fair. Conclusions Research on peripheral and neural correlates of self-harm is not sufficiently mature to identify potential biomarkers. Conflicting findings were reported for many of the correlates studied. Methodological problems may have produced biased findings and results are mainly generalizable to patients and girls. We provide recommendations to improve future peripheral and neural correlate research in children and adolescents, ages 3-19 years, with self-harm.
... We aimed to revisit the phenomenon to examine the reproducibility and replicability of the findings. Following the recent and growing recognition of the importance of reproducibility and replicability in psychological science (e.g., Brandt et al., 2014;Open Science Collaboration, 2015;van't Veer & Giner-Sorolla, 2016;Zwaan et al., 2018), we embarked on a well-powered pre-registered close replication of Bastian et al. (2012). Bastian et al. (2012) tested and found support for a number of hypotheses derived from their account of the meat paradox. ...
Preprint
Full-text available
[IMPORTANT: Method and results were written using a randomized dataset produced by Qualtrics to simulate what these sections will look like after data collection. These will be updated following the data collection. For the purpose of the simulation, we wrote things in past tense, but no pre-registration or data collection took place yet.] Bastian et al. (2012) argued that the ‘meat paradox’–caring for animals yet eating them–is maintained by motivated moral disengagement driven by a psychologically aversive tension between people’s moral standards (caring for animals) and their behavior (eating them). One disengagement mechanism that is thought to play a central role is the denial of food animal minds, and therefore their status as moral patients. This idea has garnered substantial interest and has framed much of the psychological approach to meat consumption. We propose to subject Studies 1 and 2 of Bastian et al. (2012) to high-powered direct replications. For Study 1, our replication [failed to find/found] support for the original findings: perceptions of animals’ minds were negatively related to their perceived edibility, and positively related to moral concern for them and negative affect related to eating them, [summary effect sizes + CIs will be added here]. For Study 2, our replication [failed to find/found] an effect of learning that animals will be used for food on the tendency to deny them mental capabilities. Overall, our findings [matched/did not match] with the original’s, and we [found/failed to find] support for the relationship between animal mind denial and perceptions of their status as sources of food. Materials, data, and code are available on the OSF: https://osf.io/h2pqu/.
... This problematic situation is likely to improve, given recent developments and suggestions made by the open science movement(Zwaan, Etz, Lucas, & Donnellan, 2018).31 We note that Marr's algorithmic level does not necessarily involve literal algorithms. ...
Article
Full-text available
A recent trend in psycholinguistic research has been to posit prediction as an essential function of language processing. The present paper develops a linguistic perspective on viewing prediction in terms of pre-activation. We describe what predictions are and how they are produced. Our basic premises are that (a) no prediction can be made without knowledge to support it; and (b) it is therefore necessary to characterize the precise form of that knowledge, as revealed by a suitable theory of linguistic representations. We describe the Parallel Architecture (PA: Jackendoff, 2002; Jackendoff & Audring, 2020), which makes explicit our commitments about linguistic representations, and we develop an account of processing based on these representations. Crucial to our account is that what have been traditionally treated as derivational rules of grammar are formalized by the PA as lexical items, encoded in the same format as words. We then present a theory of prediction in these terms: linguistic input activates lexical items whose beginning (or incipit) corresponds to the input encountered so far; and prediction amounts to pre-activation of the as yet unheard parts of those lexical items (the remainder). Thus the generation of predictions is a natural byproduct of processing linguistic representations. We conclude that the PA perspective on pre-activation provides a plausible account of prediction in language processing that bridges linguistic and psycholinguistic theorizing.
... Initially, the hypotheses were established in a larger sample of students and, subsequently, replication tests were done in the community sample. This is an ecologically valid testing strategy that renders robust effects [72]. To be precise, the following hypotheses were tested in the present study: (1) IU mediates the effect of lower family income on religiousness; (2) IU mediates the effect of lower caste status on religiousness, and; (3) IU mediates the effect of lower education on religiousness. ...
Article
Full-text available
The relationship between lower socioeconomic status (SES) and religiousness is well known; however, its (psychological mediation) mechanism is not clear. In the present study, we studied the mediation role of intolerance of uncertainty (IU; a personality measure of self-uncertainty) in the effect of SES on religiousness and its dimensions (i.e., believing, bonding, behaving, and belonging), in two different samples (students sample, N = 868, and community sample, N = 250), after controlling the effects of factors like age, sex, handed-ness, and self-reported risk-taking. The results showed that IU mediated the effects of lower family income and lower caste status (in students' sample only) on religiousness and its dimensions; higher caste status had a direct effect on religiousness (and its dimensions), and; among the sub-factors of IU, only prospective IU affected religiousness. Thus, along with showing that IU is a mediator of the effects of lower family income and lower caste status on religiousness, the present study supports the contention that religiousness is a latent variable that varied factors can independently initiate. Moreover, the present study suggests a nuanced model of the relationship between the hierarchical caste system and religiousness.
... Thus, how reliance on misinformation changes across a delay may depend on the actual information that is included in the retraction. The finding that providing a retraction is equally effective after no delay and a two-delay should be replicated (for the importance of replication, see Camerer et al., 2018;Pashler & Harris, 2012;Roediger III, 2012;Simons, 2014;Zwaan et al., 2018), but is nonetheless encouraging, as it provides initial evidence that the effectiveness of a retraction for reducing reliance on misinformation can persist over time. Continuing to explore how memory and inferential reasoning is impacted across various delays will yield important information about the lasting impact of the CIE. ...
Article
Research suggests exposure to misinformation continues to impact belief and reasoning, even if that misinformation has been corrected (referred to as the Continued Influence Effect, CIE). The present experiment explores two potentially important factors that may impact the effect: (1) learner age and (2) length of delay between retraction and final test. During initial learning, participants (both young and older adults) read six scenarios in which a critical piece of misinformation was either retracted or not retracted. Following no delay, a short (ten minutes) delay, or a long (two days) delay, participants then answered inferential reasoning questions about the previously‐studied scenarios to evaluate how (if at all) the prior retraction impacts reliance on misinformation. Outcomes help us to understand the ways in which misinformation (even following retraction) impacts reasoning, an issue of exceeding importance as the proliferation of fake news shows no signs of slowing. This article is protected by copyright. All rights reserved.
... This astonishing reach of Gestalt theory to an immense number of fields within psychology, and possibly even beyond psychology 33 is probably one of its biggest strengths. And given the current replication crisis in the empirical, social and behavioral sciences and the lack of coherent findings that do replicate (see Zwaan et al., 2018; also see Witte & Zenker comment in Zwaan et al., 2018) something has to change probably in ways that are more than palliative solutions. And most important, empirical psychology has to value the phenomenal, experiential component in its search 30 In fact, Gestalt theory used a lot of examples from music to exemplify their proposals. ...
... We therefore aimed to revisit the classic phenomenon to examine the reproducibility and replicability of the findings with independent replications. We followed recent growing recognition of the importance of reproducibility and replicability in psychological science (e.g., Brandt et al., 2014;Open Science Collaboration, 2015;van't Veer & Giner-Sorolla, 2016;Zwaan et al., 2018) and embarked on a well-powered pre-registered very close replication of Peters et al. (2006). Peters et al. (2006) conducted four studies and we aimed to replicate all of them with needed adjustments and collected in a single data collection, with the experiments displayed in a random order (more on that in the methods section). ...
Preprint
Full-text available
[IMPORTANT: Results were written in past tense using a randomized dataset produced by Qualtrics to simulate what these sections will look like after data collection. These will be updated following the data collection.] Numeracy is individuals’ capacity to understand and process basic probability and numerical information required to make decisions. We conducted a pre-registered replication and extension of Peters et al. (2006) examining associations between numeracy and positive-negative framing (Experiment 1), frequency-percentage framing (Experiment 2), ratio effect (Experiment 3), and loss vs. no-loss (Experiment 4). We collected data with an online US American Amazon Mechanical Turk sample (N =850). Our replication [failed to find/found] support for the original findings regarding associations between numeracy and four decision-making effects: [summary effect sizes+CIs will be added here]. Extending the replication, we [found/failed to find] support for an association between numeracy and confidence [summary effect sizes+CIs will be added here]. Materials, data, and code are available on: https://osf.io/4hjck/.
... The AI C and AI C c statistics include complexity penalties in such a way that (in ideal situations) the preferred model will better predict new replication data. Given the importance of replication in scientific investigations (Earp & Trafimow, 2015;Zwaan et al., 2018), these statistics seem like they would be of interest to many scientists. One advantage of this approach compared to frequentist hypothesis testing is that it can provide support for the null model, compared to the alternative model. ...
Article
Recent insights into problems with common statistical practice in psychology have motivated scientists to consider alternatives to the traditional frequentist approach that compares p-values to a significance criterion. While these alternatives have worthwhile attributes, Francis (Behavior Research Methods, 40, 1524-1538, 2017) showed that many proposed test statistics for the situation of a two-sample t-test are based on precisely the same information in a given data set; and for a given sample size, one can convert from any statistic to the others. Here, we show that the same relationship holds for the equivalent of a one-sample t-test. We derive the relationships and provide an on-line app that performs the computations. A key conclusion of this analysis is that many types of tests are based on the same information, so the choice of which approach to use should reflect the intent of the scientist and the appropriateness of the corresponding inferential framework for that intent.
... These studies are interesting in themselves, as they demonstrate an influence of masculine generics on specific aspects of cognition with different paradigms and measures, but they do not constitute replications of the study by . Successful replications, however, particularly replications of effects that were originally found in underpowered studies, are needed to elevate an effect from single observation to scientific evidence (Zwaan et al., 2018; see also Cesario, 2014). ...
Article
Full-text available
The use of masculine generics (i.e., grammatically masculine forms that refer to both men and women) is prevalent in many languages but has been criticized for potentially triggering male bias. Empirical evidence for this claim exists but is often based on small and selective samples. This study is a high-powered and pre-registered replication and extension of a 20-year-old study on this biasing effect in German speakers. Under 1 of 4 conditions (masculine generics vs. three gender-inclusive alternatives), 344 participants listed 3 persons of 6 popular occupational categories (e.g., athletes, politicians). Despite 20 years of societal changes, results were remarkably similar, underscoring the high degree of automaticity involved in language comprehension (large effects of 0.71 to 1.12 of a standard deviation). Male bias tended to be particularly pronounced later rather than early in retrieval, suggesting that salient female exemplars may be recalled first but that male exemplars still dominate the overall categorical representations.
... B. Klein, 2014;Stroebe & Strack, 2014). (Given this spotlighting of theory, Zwaan et al. [2018] suggest that the term "conceptual" is a misnomer and that a more appropriate designation would be "extension" to refer to testing and extending theory.) Advocates of conceptual replication stress the cumulative nature of science not as amassing countless empirical findings but as progressing through theory development, refinement, and sometimes replacement. ...
Article
Full-text available
Although psychology’s recent crisis has been attributed to various scientific practices, it has come to be called a “replication crisis,” prompting extensive appraisals of this putatively crucial scientific practice. These have yielded disagreements over what kind of replication is to be preferred and what phenomena are being explored, yet the proposals are all grounded in a conventional philosophy of science. This article proposes another avenue that invites moving beyond a discovery metaphor of science to rethink research as enabling realities and to consider how empirical findings enact or perform a reality. An enactment perspective appreciates multiple, dynamic realities and science as producing different entities, enactments that ever encounter differences, uncertainties, and precariousness. The axioms of an enactment perspective are described and employed to more fully understand the two kinds of replication that predominate in the crisis disputes. Although the enactment perspective described here is a relatively recent development in philosophy of science and science studies, some of its core axioms are not new to psychology, and the article concludes by revisiting psychologists’ previous calls to apprehend the dynamism of psychological reality to appreciate how scientific practices actively and unavoidably participate in performativity of reality.
... Replications are an integral part of cumulative experimental science (e.g., Campbell, 1969;Rosenthal, 1990;Zwaan et al., 2018). Yet many scientific disciplines do not replicate enough. ...
Preprint
Replications are an integral part of cumulative experimental science. Yet many scientific disciplines do not replicate enough because novel confirmatory findings are valued over direct replications. To provide a systematic assessment of the replication landscape in experimental linguistics, the present study estimated replication rates for over 50.000 articles across 98 journals. We used automatic string matching using the Web of Science combined with in-depth manual inspections of 210 papers. The median rate of mentioning the search string "replicat*" was as low as 1.6%. Subsequent manual analyses revealed that only eight of these were direct replications, i.e. studies that arrive at the same scientific conclusions as an initial study by using exactly the same methodology. Moreover, only 1 in 1600 experimental linguistic studies reports a direct replication performed by independent researchers. We conclude that, similar to neighboring disciplines, experimental linguistics does not replicate enough.
... Replications are important in experimental studies, but in order to ensure robustness it is also important that the causal mechanisms can be tested and identified with different survey related tools. The empirical findings from an experimental set-up should be independent of rating scales and question wording (Feest 2019;Zwaan et al. 2018). ...
Article
Full-text available
Despite the continued electoral progress of the radical right, there are reasons to believe that its full electoral potential has yet to be revealed. Previous research suggests that it suffers from a stigmatisation effect and that many voters will find its proposals less compelling compared to if they were presented by a mainstream party even for policy issues they agree upon. This study employs a unique survey design, with two experiments conducted seven years apart, on a panel of Swedish voters. The aim is to evaluate whether proposals are assessed differently dependent on who the sender is and whether the effect diminishes as the cordon sanitaire of the party weakens. The results show that proposals are less liked if the sender is the radical right. This effect persists even after a weakening of the ostracisation of the radical right as well as for different types of political issues. Supplemental data for this article can be accessed online at: https://doi.org/10.1080/01402382.2021.2019977 .
... Replication of past findings had also been done sparingly, given that replication studies are traditionally harder to publish compared to original findings. To date, however, more and more journals (e.g., Journal of Experimental Psychology: General) and funding agencies (e.g., the Netherlands Organisation for Scientific Research) call for such studies, giving hope that this will be a way to reduce QRPs (Zwaan, Etz, Lucas, & Donnellan, 2018). ...
Chapter
Full-text available
Questionable research practices (QRPs), such as p-hacking (i.e., the inappropriate manipulation of data analysis to find statistical significance) and post hoc hypothesizing, are threats to the replicability of research findings. One key solution to the problem of QRPs is preregistration. This refers to time-stamped documentation that describes the methodology and statistical analyses of a study before the data are collected or inspected. As such, readers of the study’s report can evaluate whether the described research is in line with the planned methods and analyses or whether there are deviations from these (e.g., analyses performed so that the research hypotheses is confirmed). Here, we aim to describe what preregistration entails and why it is useful for psychology research. In this vein, we present the key elements of a sufficient preregistration file, its advantages as well as its disadvantages, and why preregistration is a key, yet partially insufficient, solution against QRPs. By the end of this chapter, we hope that readers are convinced that there is little reason not to preregister their research.KeywordsQuestionable research practicesPreregistrationPsychological scienceClinical science
... The last decade has also seen substantial discussion around the need to invest in and reward replication studies (Koole & Lakens, 2012;Zwaan, Etz, Lucas, & Donnellan, 2018). Direct replications form a vital part of the evidence base in establishing the robustness of findings, but they can be challenging for authors to publish and are not often published in the highest-profile journals. ...
... Closely related to the prior points, another key recommendation for future work is to carry out well-powered direct replications of previous findings Zwaan et al., 2018). To date, the only published direct replication of a neurostimulation study of embodied language comprehension is Gianelli and Dalla Volta's (2015) study. ...
Article
Full-text available
According to the embodied cognition view, comprehending action-related language requires the participation of sensorimotor processes. A now sizeable literature has tested this proposal by stimulating (with TMS or tDCS) motor brain areas during the comprehension of action language. To assess the evidential value of this body of research, we exhaustively searched the literature and submitted the relevant studies (N = 43) to p-curve analysis. While most published studies concluded in support of the embodiment hypothesis, our results suggest that we cannot yet assert beyond reasonable doubt that they explore real effects. We also found that these studies are quite underpowered (estimated power < 30%), which means that a large percentage of them would not replicate if repeated identically. Additional tests for excess significance show signs of publication bias within this literature. In sum, extant brain stimulation studies testing the grounding of action language in the motor cortex do not stand on solid ground. We provide recommendations that will be important for future research on this topic.
... Replication of past findings had also been done sparingly, given that replication studies are traditionally harder to publish compared to original findings. To date, however, more and more journals (e.g., Journal of Experimental Psychology: General) and funding agencies (e.g., the Netherlands Organisation for Scientific Research) call for such studies, giving hope that this will be a way to reduce QRPs (Zwaan, Etz, Lucas, & Donnellan, 2018). ...
Preprint
Full-text available
{PREPRINT} Questionable Research Practices (QRPs), such as p-hacking (i.e., the inappropriate manipulation of data analysis to find statistical significance) and post-hoc hypothesizing, are threats to replicability of research findings. One key solution to the problem of QRPs is pre- registration. This refers to time-stamped documentation that describes the methodology and statistical analyses of a study before the data are collected or inspected. As such, readers of the study’s report can evaluate whether the described research is in line with the planned methods and analyses or there are deviations from these (e.g., analyses performed so that the research hypotheses is confirmed). Here, we aim to describe what preregistration entails and why it is useful for psychology research. In this vein, we present the key elements of a sufficient preregistration file, its advantages, as well as its disadvantages, and why preregistration is a key, yet partially insufficient, solution against QRPs. By the end of this chapter, we hope that readers are convinced that there is little reason not to preregister their research.
... At first glance, replications may appear to garner greater appreciation in social psychology today. For example, different systematic replication efforts (e.g., the "Many Labs" projects; e.g., Ebersole et al., 2016;Klein et al., 2014) (Hüffmeier et al., 2016;Zwaan et al., 2018). ...
Article
Full-text available
A decade ago, replications were typically not conducted and appreciated in social psychology, although replications play a central role in ensuring trust in scientific fields. Without systematic replication efforts, it is not clear whether findings are trustworthy. As journals can function as gatekeepers for publications, they can influence whether researchers conduct (and publish) replications. Yet, the scholarly culture in social psychology might have changed over the last decade because numerous highly visible studies did not replicate past findings. In lightof these insights and the resulting learning opportunities for the field, we predicted an increase in the expressed support for replications in the policies of social psychology journals from 2015 (i.e., the year the replication problem became widely known) to 2022. We coded whether and how replications were mentioned in the author guidelines on the websites of social psychology journals (N = 51). As expected, replications were welcomed more often in 2022 (25%) than they were in 2015 (12%), but they were not mentioned on the websites of the majority of journals (71% in 2022 vs. 82% in 2015). An exploratory analysis suggested that journals that expressed support for replications on their websites were also more likely to publish articles about replication. Further, exploratory analyses of the journals’ TOP factors indicated similar rates of support for replications as for other rigor and transparency promoting policies. In sum, our findings suggest that appreciation for replication has increased, but is not yet part of mainstream culture in social psychology.
... The Replication Crisis 27 , according to which many findings in social, behavioral and biomedical sciences have failed to replicate at alarming rates, has led to many interesting developments when it comes to understand what is a replication and what purpose it can serve. I will briefly present a typology of replication adapted from Schmidt and Oh (2016), Schmidt (2016), Zwaan et al. (2018) andFletcher (2021) before going back to our current issue. The typology I suggest decline replication along four categories. ...
Preprint
Full-text available
Measuring the rate at which the universe expands at a given time -- the 'Hubble constant' -- has been a topic of controversy since the first measure of its expansion by Edwin Hubble in the 1920's. As early as the 1970's, Sandage et de Vaucouleurs have been arguing about the adequate methodology for such a measurement. Should astronomers focus only on their best indicators, e.g., the Cepheids, and improve the precision of this measurement based on a unique object to the best possible? Or should they 'spread the risks', i.e., multiply the indicators and methodologies before averaging over their results? Is a robust agreement across several uncertain measures, as is currently argued to defend the existence of a 'Hubble crisis' more telling than a single one percent precision measurement? This controversy, I argue, stems from a misconception of what managing the uncertainties associated with such experimental measurements require. Astrophysical measurements, such as the measure of the Hubble constant, require a methodology that permits both to reduce the known uncertainties and to track the unknown unknowns. Based on the lessons drawn from the so-called Hubble crisis, I sketch a methodological guide for identifying, quantifying and reducing uncertainties in astrophysical measurements, hoping that such a guide can not only help to re-frame the current Hubble tension, but serve as a starting point for future fruitful discussions between astrophysicists, astronomers and philosophers.
... Thus, there is evidence that enduring underlying criteria influence people's sequential partner selection. However, one cornerstone of psychological research is replication (e.g., Zwaan et al., 2017). In addition, the previous work on physical similarities in sequential partner selection did not directly account for ethnicity, and did not always account for possible mis-remembering in the self-reported data used. ...
Article
Full-text available
Studies have indicated that people are attracted to partners who resemble themselves or their parents, in terms of physical traits including eye color. We might anticipate this inclination to be relatively stable, giving rise to a sequential selection of similar partners who then represent an individual’s “type”. We tested this idea by examining whether people’s sequential partners resembled each other at the level of eye color. We gathered details of the eye colors of the partners of participants (N = 579) across their adult romantic history (N = 3250 relationships), in three samples, comprising two samples which made use of self-reports from predominantly UK-based participants, and one which made use of publicly available information about celebrity relationship histories. Recorded partner eye colors comprised black (N = 39 partners), dark brown (N = 884), light brown (N = 393), hazel (N = 224), blue (N = 936), blue green (N = 245), grey (N = 34), and green (N = 229). We calculated the proportion of identical eye colors within each participant’s relationship history, and compared that to 100,000 random permutations of our dataset, using t-tests to investigate if the eye color of partners across an individual’s relationship history was biased relative to chance (i.e., if there was greater consistency, represented by higher calculated proportions of identical eye colors, in the original dataset than in the permutations). To account for possible eye color reporting errors and ethnic group matching, we ran the analyses restricted to White participants and to high-confidence eye color data; we then ran the analyses again in relation to the complete dataset. We found some limited evidence for some consistency of eye color across people’s relationship histories in some of the samples only when using the complete dataset. We discuss the issues of small effect sizes, partner-report bias, and ethnic group matching in investigating partner consistency across time.
... The Replication Crisis 27 , according to which many findings in social, behavioral and biomedical sciences have failed to replicate at alarming rates, has led to many interesting developments when it comes to understand what is a replication and what purpose it can serve. I will briefly present a typology of replication adapted from Schmidt and Oh (2016), Schmidt (2016), Zwaan et al. (2018) andFletcher (2021) before going back to our current issue. The typology I suggest decline replication along four categories. ...
Article
Measuring the rate at which the universe expands at a given time--the "Hubble constant"-- has been a topic of controversy since the first measure of its expansion by Edwin Hubble in the 1920's. As early as the 1970's, Sandage et de Vaucouleurs have been arguing about the adequate methodology for such a measurement. Should astronomers focus only on their best indicators, e.g., the Cepheids, and improve the precision of this measurement based on a unique object to the best possible? Or should they “spread the risks”, i.e., multiply the indicators and methodologies before averaging over their results? Is a robust agreement across several uncertain measures, as is currently argued to defend the existence of a "Hubble crisis" more telling than a single 1 % precision measurement? This controversy, I argue, stems from a misconception of what managing the uncertainties associated with such experimental measurements require. Astrophysical measurements, such as the measure of the Hubble constant, require a methodology that permits both to reduce the known uncertainties and to track the unknown unknowns. Based on the lessons drawn from the so-called Hubble crisis, I sketch a methodological guide for identifying, quantifying and reducing uncertainties in astrophysical measurements, hoping that such a guide can not only help to re-frame the current Hubble tension, but serve as a starting point for future fruitful discussions between astrophysicists, astronomers and philosophers.
... but are in fact reliable. In other words, the replication strongly indicates that the reported effects are not sample specific but seem to exist in the underlying population, too (see Zwaan et al., 2018). However, the replication does not provide further evidence for the generalisability of the results in the sense that it expands on the findings of Hilkenmeier et al., for instance by using different procedures or operationalisations. ...
Chapter
A traditional approach to test complex relationships between different unobserved constructs included in theoretical models is to apply covariance-based structural equation modelling (CB-SEM). This chapter aims at introducing an alternative approach to estimating structural equation models that has not yet widely been used in research on professional learning and development or in research on learning in general: Partial-least squares structural equation modelling (PLS-SEM). PLS-SEM is based on ordinary least square regression analysis and uses an iterative algorithm to find parameter estimates. This estimation approach has several advantages including fewer statistical assumptions. In addition, PLS-SEM allows for the incorporation of both lower order and higher order formative constructs as well as for estimating rather complex models, which is not always possible with CB-SEM. The conceptual explanation of this particular SEM technique will be illustrated using a replication study of a published research study, focussing on the influence of learner factors and learning context on different professional learning activities.KeywordsPLSPartial least squareSEMInformal learningWorkplace learningProfessional developmentLearning culture
... Finally, another unexplored topic regards the possibility of modulating transcallosal inhibition, which would represent evidence supporting the physiological nature of M1-P15 and thus a further validation of this early TEP component. Therefore, we designed a conceptual replication study (Zwaan et al., 2018) in which we aimed at replicating evidence for validating the M1-P15 as physiological index of interhemispheric connectivity and its role in bimanual coordination, under experimental circumstances that could overcome some limitations of the original study. ...
Article
Full-text available
In a recently published study combining transcranial magnetic stimulation and electroencephalography (TMS-EEG), an early component of TMS-evoked potentials (TEPs), i.e., M1-P15, was proposed as a measure of transcallosal inhibition between motor cortices. Given that early TEPs are known to be highly variable, further evidence is needed before M1-P15 can be considered a reliable index of e􀀀ective connectivity. Here, we conceived a new preregistered TMS-EEG study with two aims. The first aim was validating the M1-P15 as a cortical index of transcallosal inhibition by replicating previous findings on its relationship with the ipsilateral silent period (iSP) and with performance in bimanual coordination. The second aim was inducing a task-dependent modulation of transcallosal inhibition. A new sample of 32 healthy right-handed participants underwent behavioral motor tasks and TMS-EEG recording, in which left and right M1 were stimulated both during bimanual tasks and during an iSP paradigm. Hypotheses and methods were preregistered before data collection. Results show a replication of our previous findings on the positive relationship between M1-P15 amplitude and the iSP normalized area. Di􀀀erently, the relationship between M1-P15 latency and bimanual coordination was not confirmed. Finally, M1-P15 amplitude was modulated by the characteristics of the bimanual task the participants were performing, and not by the contralateral hand activity during the iSP paradigm. In sum, the present results corroborate our previous findings in validating the M1-P15 as a cortical marker of transcallosal inhibition and provide novel evidence of its task-dependent modulation. Importantly, we demonstrate the feasibility of preregistration in the TMS-EEG field to increase methodological rigor and transparency.
... While these neuroscientists clearly value replication, their 11 reasons for these failures mostly address differences between the experimental components of the original study and its attempted replication. While concerns over the requisite degree of "exactness" or "directness" for replications are common in recent literature (Schmidt 2009;Zwaan et al. 2018;Feest 2019;Romero 2019), and study differences are often noted by practicing scientists, there remains a question that has received little explicit discussion: under what circumstances should researchers rest content with a mismatch explanation for reported failures to replicate? ...
Article
Full-text available
Scientists often respond to failures to replicate by citing differences between the experimental components of an original study and those of its attempted replication. In this paper, we investigate these purported mismatch explanations. We assess a body of failures to replicate in neuroscience studies on spinal cord injury. We argue that a defensible mismatch explanation is one where (1) a mismatch of components is a difference maker for a mismatch of outcomes, and (2) the components are relevantly different in the follow-up study, given the scope of the original study. With this account, we argue that not all differences between studies are meaningful, even if they are difference makers. As our examples show, focusing only on these differences results in disregarding the representativeness of the original experiment’s components and the scope of its outcomes, undercutting other epistemic aims, such as translation, in the process.
... Although we believe the findings reported in the original study (van den Berg et al., 2020) have the potential to contribute to both theory and treatment, there is a need for replication studies in general. Replication is a crucial cornerstone of science; for something to be counted as a scientific discovery, a finding needs to be repeatable (Zwaan et al., 2017). Also, scientific credibility requires obtaining as much evidence as possible and investigating whether the evidence matches or at least is consistent with existing theories (Ioannidis, 2012). ...
Article
Full-text available
Objective: Risk of sexual reoffending of adult men who committed sexual offenses can be understood as involving a network of causally connected dynamic risk factors. This study examined to what degree findings from previous network analyses, estimated using data from the Dynamic Supervision Project (N = 803; van den Berg et al., 2020), could be replicated. Method: Networks produced with data from the provincial corrections system of British Columbia (N = 4,511) were compared with those found in the original sample, using the Network Comparison Test (van Borkulo et al., 2019) and by correlating both the adjacency matrices of the networks and the rank of the node’s strength centrality across networks. Results: Networks without recidivism, with sexual recidivism, and with violent recidivism (including sexual contact) statistically significant differ in network structure, but not in global strength. Both the adjacency matrices of the networks as well as the rank of the node’s strength centrality across networks were highly correlated. Dynamic risk factors general social rejection/loneliness, lack of concern for others, poor cognitive problem-solving, and impulsive acts showed high-strength centralities. Besides, all networks contained distinct communities of risk factors related to sexual self-regulation, emotionally intimate relationships, antisocial traits, and self-management. Conclusions: We successfully replicated most findings of our original study. Dynamic risk factors concerning social rejection/loneliness, cognitive problem-solving skills, impulsive behavior, and callousness appear to have a relatively strong role in the risk of sexual reoffending. Risk management and treatment strategies to reduce recidivism would benefit from a stronger focus on these dynamic risk factors.
... It is not easy to evaluate the performance of z-curve 2.0 estimates with actual data because selection bias is ubiquitous and direct replication studies are fairly rare (Zwaan et al., 2018). A notable exception is the Open Science Collaboration project that replicated 100 studies from three psychology journals (Open Science Collaboration, 2015). ...
Article
Full-text available
Selection for statistical significance is a well-known factor that distorts the published literature and challenges the cumulative progress in science. Recent replication failures have fueled concerns that many published results are false-positives. Brunner and Schimmack (2020) developed z-curve, a method for estimating the expected replication rate (ERR) – the predicted success rate of exact replication studies based on the mean power after selection for significance. This article introduces an extension of this method, z-curve 2.0. The main extension is an estimate of the expected discovery rate (EDR) – the estimate of a proportion that the reported statistically significant results constitute from all conducted statistical tests. This information can be used to detect and quantify the amount of selection bias by comparing the EDR to the observed discovery rate (ODR; observed proportion of statistically significant results). In addition, we examined the performance of bootstrapped confidence intervals in simulation studies. Based on these results, we created robust confidence intervals with good coverage across a wide range of scenarios to provide information about the uncertainty in EDR and ERR estimates. We implemented the method in the zcurve R package (Bartoš & Schimmack, 2020).
Book
Full-text available
Bu kitap yaklaşık son on yıldır var olan, üzerinde daha fazla düşünülerek evrilmiş ve gelişen teknolojinin desteğini de alarak daha önce var olmayan birçok imkân sunan açık bilim pratikleri üzerine yazılmıştır. Açık bilim ve pratikleri birçok araştırmacı tarafından bilinmemektedir ve tecrübeli araştırmacılara dahi yabancı bir kavram olabilmektedir. Açık bilim prensiplerini teorik olarak bilen ama nasıl uygulanacağı konusunda tereddütte olan araştırmacılar da vardır ki çok kısa bir geçmişi olan bir kavram olduğu düşünüldüğünde bu pek şaşırtıcı değildir. Bu kitabı yazma amacımız açık bilim prensip ve uygulamalarını hem teorik hem de pratik olarak tanıtmak ve bu sayede araştırmacıların kendi bilimsel araştırmalarında bu pratiklerden faydalanmasını sağlamaktır. Bu kitap yüksek lisans ve üzeri eğitim almış tüm araştırmacıların faydalanabileceği bir kaynaktır.
Article
Full-text available
The phenomenon that contemplating future events elicits stronger emotions than contemplating past events has been coined “temporal value asymmetry” (TVA) (Caruso et al. 2008). We conducted very close replications of three experiments derived from two influential TVA papers: Studies 1 and 4 in Caruso et al. (2008), demonstrating TVA in monetary valuation, and Study 1 in Caruso (2010), demonstrating TVA in moral judgment. We also attempted to conceptually replicate whether TVA in monetary valuation would extend to moral judgments. We failed to find support for TVA in monetary valuation (Caruso et al., 2008). We also failed to find support for TVA in moral judgments (Caruso, 2010) and in our conceptual extension. Exploratory analyses excluding potential outliers and z-transforming the dependent variable were consistent with our preregistered analyses. We discuss potential explanations for our results and future directions for research about the effects of time on judgments of value and morality.
Article
This article argues that the same epistemological assumptions cannot be confidently applied in the transition from the biological to the social arenas of psychology, as a consequence of the sociocultural instability resulting from human linguistic and technological flair. To illustrate this contention, reference is made to historicist theses within critical and sociocultural psychology, the work of Ian Hacking and Norbert Elias, the centrality of language and technology to sociocultural instability, and the illustrative issues raised by cultural neuroscience and replication studies.
Article
Full-text available
Cognitive neuroscience comes in many facets, and a particularly large branch of research is conducted in individuals with mental health problems. This article outlines why it is important that cognitive neuroscientists re-shape their role in mental health research and re-define directions of research for the next decades. At present, cognitive neuroscience research in mental health is too firmly rooted in categorial diagnostic definitions of mental health conditions. It is discussed why this hampers a mechanistic understanding of brain functions underlying mental health problems and why this is a problem for replicability in research. A possible solution to these problems is presented. This solution affects the strategy of research questions to be asked, how current trends to increase replicability in research can or cannot be applied in the mental health field and how data are analyzed. Of note, these aspects are not only relevant for the scientific process, but affect the societal view on mental disorders and the position of affected individuals as members of society, as well as the debate on the inclusion of so-called WEIRD and non-WEIRD people in studies. Accordingly, societal and science political aspects of re-defining the role of cognitive neuroscientists in mental health research are elaborated that will be important to shape cognitive neuroscience in mental health for the next decades.
Chapter
Increasing evidence indicates that many published findings in psychology may be overestimated or even false. An often-heard response to this “replication crisis” is to replicate more: replication studies should weed out false positives over time and increase the robustness of psychological science. However, replications take time and money – resources that are often scarce. In this chapter, I propose an efficient alternative strategy: a four-step robustness check that first focuses on verifying reported numbers through reanalysis before replicating studies in a new sample.KeywordsRobustness of psychological research findingsFour-step robustness checkReplication crisis
Article
Psychological science constructs much of the knowledge that we consume in our everyday lives. This book is a systematic analysis of this process, and of the nature of the knowledge it produces. The authors show how mainstream scientific activity treats psychological properties as being fundamentally stable, universal, and isolable. They then challenge this status quo by inviting readers to recognize that dynamics, context-specificity, interconnectedness, and uncertainty, are a natural and exciting part of human psychology – these are not things to be avoided and feared, but instead embraced. This requires a shift toward a process-based approach that recognizes the situated, time-dependent, and fundamentally processual nature of psychological phenomena. With complex dynamic systems as a framework, this book sketches out how we might move toward a process-based praxis that is more suitable and effective for understanding human functioning.
Chapter
Full-text available
Doctoral thesis monograph - Sarahanne M. Field, 2022
Article
Full-text available
It has been almost ten years since Bem published the psi study in a prestigious social psychology journal, which ignited the replicability crisis in psychology. Since then, drastic and systematic changes in research practices have been proposed and implemented in the field. After a decade of such controversy and reformation, what is the current status of psychology? We provide an overview of the 10 years of credibility revolution in psychology by taking the perspectives of “researcher's degree of freedom” and “specification space.” Based on the view, we propose possible future directions for psychology to proceed as a scientific discipline.
Article
Full-text available
The open and transparent documentation of scientific processes has been established as a core antecedent of free knowledge. This also holds for generating robust insights in the scope of research projects. To convince academic peers and the public, the research process must be understandable and retraceable ( reproducible ), and repeatable ( replicable ) by others, precluding the inclusion of fluke findings into the canon of insights. In this contribution, we outline what reproducibility and replicability (R&R) could mean in the scope of different disciplines and traditions of research and which significance R&R has for generating insights in these fields. We draw on projects conducted in the scope of the Wikimedia "Open Science Fellows Program" ( Fellowship Freies Wissen ), an interdisciplinary, long-running funding scheme for projects contributing to open research practices. We identify twelve implemented projects from different disciplines which primarily focused on R&R, and multiple additional projects also touching on R&R. From these projects, we identify patterns and synthesize them into a roadmap of how research projects can achieve R&R across different disciplines. We further outline the ground covered by these projects and propose ways forward.
Preprint
Full-text available
In the current commentary, we want to challenge journals in the Industrial, Work, and Organizational (IWO) Psychology field and the broader Management field to (more strongly) support the use of Open Science Practices (OSPs). We believe that this challenge is necessary because - despite the strong evidence for the usefulness and effectiveness of OSPs - most pertinent journals currently still fail to support their use. Not taking action is, however, not a viable option in view of the problematic state of the pertinent literature. We present our arguments by listing and commenting on common objections against the support for OSPs that we heard in our own research with scientists and journal editors as well as in informal conversations with colleagues.
Chapter
Researchers have studied non-human primate cognition along different paths, including social cognition, planning and causal knowledge, spatial cognition and memory, and gestural communication, as well as comparative studies with humans. This volume describes how primate cognition is studied in labs, zoos, sanctuaries, and in the field, bringing together researchers examining similar issues in all of these settings and showing how each benefits from the others. Readers will discover how lab-based concepts play out in the real world of free primates. This book tackles pressing issues such as replicability, research ethics, and open science. With contributors from a broad range of comparative, cognitive, neuroscience, developmental, ecological, and ethological perspectives, the volume provides a state-of-the-art review pointing to new avenues for integrative research.
Article
Despite extensive research, the understanding of human health changes that result from long-duration spaceflight remains limited, in part due to the wide range of study types and designs, lack of independent experiment replication and data dispersal in many articles. We have compiled a database from 37 health studies that reported data for 517 parameters from missions of longer than 3-month duration on the International Space Station. We found an abundance of physiological and biochemical parameters and limited psychological/behavioral in-flight data. When we compared in-flight to pre-flight data, 14 out of 40 studied measurement type categories changed significantly, whereas only 3 categories changed significantly post-flight. Collagen breakdown biomarkers in urine showed the greatest effect, a 2-fold increase in-flight, but no data in this category was reported post-flight. Eye movements related to the vestibular system function had the greatest in-flight effect that was sustained post-flight, with a decrease of 81% in-flight and a 32% decrease remained post-flight. Analysis of the in-flight compared to post-flight biochemical and physiological changes revealed overall low correlations (R² = 0.03 & R² = 0.23, respectively), as parameters tend to return to baseline post-flight. As we look to longer duration space missions, this review provides an opportunity to identify and highlight salient results that have been reported to date for long duration spaceflight, to enhance our understanding of space health and to develop effective countermeasures. We believe that the compiled data could be explored for medical interpretation of the observed changes, in-flight timeline of changes for natural history studies, correlation analysis for parameters across different body systems, and comparison of in-flight responses to ground-based studies.
Article
Full-text available
Significance: Optical neuroimaging has become a well-established clinical and research tool to monitor cortical activations in the human brain. It is notable that outcomes of functional near-infrared spectroscopy (fNIRS) studies depend heavily on the data processing pipeline and classification model employed. Recently, deep learning (DL) methodologies have demonstrated fast and accurate performances in data processing and classification tasks across many biomedical fields. Aim: We aim to review the emerging DL applications in fNIRS studies. Approach: We first introduce some of the commonly used DL techniques. Then, the review summarizes current DL work in some of the most active areas of this field, including brain-computer interface, neuro-impairment diagnosis, and neuroscience discovery. Results: Of the 63 papers considered in this review, 32 report a comparative study of DL techniques to traditional machine learning techniques where 26 have been shown outperforming the latter in terms of the classification accuracy. In addition, eight studies also utilize DL to reduce the amount of preprocessing typically done with fNIRS data or increase the amount of data via data augmentation. Conclusions: The application of DL techniques to fNIRS studies has shown to mitigate many of the hurdles present in fNIRS studies such as lengthy data preprocessing or small sample sizes while achieving comparable or improved classification accuracy.
Article
Full-text available
Self-control is of vital importance for human wellbeing. Hare et al. (2009) were among the first to provide empirical evidence on the neural correlates of self-control. This seminal study profoundly impacted theory and empirical work across multiple fields. To solidify the empirical evidence supporting self-control theory, we conducted a preregistered replication of this work. Further, we tested the robustness of the findings across analytic strategies. Participants underwent functional magnetic resonance imaging while rating 50 food items on healthiness and tastiness and making choices about food consumption. We closely replicated the original analysis pipeline and supplemented it with additional exploratory analyses to follow-up on unexpected findings and to test the sensitivity of results to key analytical choices. Our replication data provide support for the notion that decisions are associated with a value signal in ventromedial prefrontal cortex (vmPFC), which integrates relevant choice attributes to inform a final decision. We found that vmPFC activity was correlated with goal values regardless of the amount of self-control and it correlated with both taste and health in self-controllers but only taste in non-self-controllers. We did not find strong support for the hypothesized role of left dorsolateral prefrontal cortex (dlPFC) in self-control. The absence of statistically significant group differences in dlPFC activity during successful self-control in our sample contrasts with the notion that dlPFC involvement is required in order to effectively integrate longer-term goals into subjective value judgments. Exploratory analyses highlight the sensitivity of results (in terms of effect size) to the analytical strategy, for instance, concerning the approach to region-of-interest analysis.
Thesis
Full-text available
In this thesis I explore the extent to which researchers of animal cognition should be concerned about the reliability of its scientific results and the presence of theoretical biases across research programmes. To do so I apply and develop arguments borne in human psychology’s “replication crisis” to animal cognition research and assess a range of secondary data analysis methods to detect bias across heterogeneous research programmes. After introducing these topics in Chapter 1, Chapter 2 makes the argument that areas of animal cognition research likely contain many findings that will struggle to replicate in direct replication studies. In Chapter 3, I combine two definitions of replication to outline the relationship between replication and theory testing, generalisability, representative sampling, and between-group comparisons in animal cognition. Chapter 4 then explores deeper issue in animal cognition research, examining how the academic systems that might select for research with low replicability might also select for theoretical bias across the research process. I use this argument to suggest that much of the vociferous methodological criticism in animal cognition research will be ineffective without considering how the academic incentive structure shapes animal cognition research. Chapter 5 then beings my attempt to develop methods to detect bias and critically and quantitatively synthesise evidence in animal cognition research. In Chapter 5, I led a team examining publication bias and the robustness of statistical inference in studies of animal physical cognition. Chapter 6 was a systematic review and a quantitative risk-of-bias assessment of the entire corvid social cognition literature. And in Chapter 7, I led a team assessing how researchers in animal cognition report and interpret non-significant statistical results, as well as the p-value distributions of non-significant results across a manually extracted dataset and an automatically extracted dataset from the animal cognition literature. Chapter 8 then reflects on the difficulties of synthesising evidence and detecting bias in animal cognition research. In Chapter 9, I present survey data of over 200 animal cognition researchers who I questioned on the topics of this thesis. Finally, Chapter 10 summarises the findings of this thesis, and discusses potential next steps for research in animal cognition.
Article
A fundamental goal of scientific research is to generate true positives (i.e., authentic discoveries). Statistically, a true positive is a significant finding for which the underlying effect size (δ) is greater than 0, whereas a false positive is a significant finding for which δ equals 0. However, the null hypothesis of no difference (δ = 0) may never be strictly true because innumerable nuisance factors can introduce small effects for theoretically uninteresting reasons. If δ never equals zero, then with sufficient power, every experiment would yield a significant result. Yet running studies with higher power by increasing sample size (N) is one of the most widely agreed upon reforms to increase replicability. Moreover, and perhaps not surprisingly, the idea that psychology should attach greater value to small effect sizes is gaining currency. Increasing N without limit makes sense for purely measurement-focused research, where the magnitude of δ itself is of interest, but it makes less sense for theory-focused research, where the truth status of the theory under investigation is of interest. Increasing power to enhance replicability will increase true positives at the level of the effect size (statistical true positives) while increasing false positives at the level of theory (theoretical false positives). With too much power, the cumulative foundation of psychological science would consist largely of nuisance effects masquerading as theoretically important discoveries. Positive predictive value at the level of theory is maximized by using an optimal N, one that is neither too small nor too large.
Article
Full-text available
We describe a general method that allows experimenters to quantify the evidence from the data of a direct replication attempt given data already acquired from an original study. These so-called replication Bayes factors are a reconceptualization of the ones introduced by Verhagen and Wagenmakers (Journal of Experimental Psychology: General, 143(4), 1457–1475 2014) for the common t test. This reconceptualization is computationally simpler and generalizes easily to most common experimental designs for which Bayes factors are available.
Preprint
Full-text available
Twenty-nine teams involving 61 analysts used the same dataset to address the same research question: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players. Analytic approaches varied widely across teams, and estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. Twenty teams (69%) found a statistically significant positive effect and nine teams (31%) observed a non-significant relationship. Overall 29 different analyses used 21 unique combinations of covariates. We found that neither analysts' prior beliefs about the effect, nor their level of expertise, nor peer-reviewed quality of analysis readily explained variation in analysis outcomes. This suggests that significant variation in analysis of complex data may be difficult to avoid, even by experts with honest intentions. Crowdsourcing data analysis, a strategy by which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective analytic choices influence research results. Currently available at: https://psyarxiv.com/qkwst/
Article
Full-text available
Finkel, Eastwick, and Reis (2016; FER2016) argued the post-2011 methodological reform movement has focused narrowly on replicability, neglecting other essential goals of research. We agree multiple scientific goals are essential, but argue, however, a more fine-grained language, conceptualization, and approach to replication is needed to accomplish these goals. Replication is the general empirical mechanism for testing and falsifying theory. Sufficiently methodologically similar replications, also known as direct replications, test the basic existence of phenomena and ensure cumulative progress is possible a priori. In contrast, increasingly methodologically dissimilar replications, also known as conceptual replications, test the relevance of auxiliary hypotheses (e.g., manipulation and measurement issues, contextual factors) required to productively investigate validity and generalizability. Without prioritizing replicability, a field is not empirically falsifiable. We also disagree with FER2016's position that "bigger samples are generally better, but that very large samples could have the downside of commandeering resources that would have been better invested in other studies" (abstract). We identify problematic assumptions involved in FER2016's modifications of our original research-economic model, and present an improved model that quantifies when (and whether) it is reasonable to worry that increasing statistical power will engender potential trade-offs. Sufficiently powering studies (i.e., >80%) maximizes both research efficiency and confidence in the literature (research quality). Given that we are in agreement with FER2016 on all key open science points, we are eager to start seeing the accelerated rate of cumulative knowledge development of social psychological phenomena such a sufficiently transparent, powered, and falsifiable approach will generate.
Article
Full-text available
Many argue that there is a reproducibility crisis in psychology. We investigated nine well-known effects from the cognitive psychology literature—three each from the domains of perception/action, memory, and language, respectively—and found that they are highly reproducible. Not only can they be reproduced in online environments, but they also can be reproduced with nonnaïve participants with no reduction of effect size. Apparently, some cognitive tasks are so constraining that they encapsulate behavior from external influences, such as testing situation and prior recent experience with the experiment to yield highly robust effects.
Article
Full-text available
Many argue that there is a reproducibility crisis in psychology. We investigated nine well-known effectsfrom the cognitive psychology literature—three eachfrom the domains of perception/action, memory, and language,respectively—and find that they are highly reproducible. Not only can they be reproduced in online environments but they can also be reproduced with nonnaïve participants with no reduction of effect size. Apparently, some cognitive tasks are so constraining that they encapsulate behavior from external influences such as testing situation and prior recent experience with the experiment to yield highly robust effects.
Article
Full-text available
The vast majority of published results in the literature is statistically significant, which raises concerns about their reliability. The Reproducibility Project Psychology (RPP) and Experimental Economics Replication Project (EE-RP) both replicated a large number of published studies in psychology and economics. The original study and replication were statistically significant in 36.1% in RPP and 68.8% in EE-RP suggesting many null effects among the replicated studies. However, evidence in favor of the null hypothesis cannot be examined with null hypothesis significance testing. We developed a Bayesian meta-analysis method called snapshot hybrid that is easy to use and understand and quantifies the amount of evidence in favor of a zero, small, medium and large effect. The method computes posterior model probabilities for a zero, small, medium, and large effect and adjusts for publication bias by taking into account that the original study is statistically significant. We first analytically approximate the methods performance, and demonstrate the necessity to control for the original study’s significance to enable the accumulation of evidence for a true zero effect. Then we applied the method to the data of RPP and EE-RP, showing that the underlying effect sizes of the included studies in EE-RP are generally larger than in RPP, but that the sample sizes of especially the included studies in RPP are often too small to draw definite conclusions about the true effect size. We also illustrate how snapshot hybrid can be used to determine the required sample size of the replication akin to power analysis in null hypothesis significance testing and present an easy to use web application (https://rvanaert.shinyapps.io/snapshot/) and R code for applying the method.
Article
Full-text available
Finkel, Rusbult, Kumashiro, and Hannon (2002, Study 1) demonstrated a causal link between subjective commitment to a relationship and how people responded to hypothetical betrayals of that relationship. Participants primed to think about their commitment to their partner (high commitment) reacted to the betrayals with reduced exit and neglect responses relative to those primed to think about their independence from their partner (low commitment). The priming manipulation did not affect constructive voice and loyalty responses. Although other studies have demonstrated a correlation between subjective commitment and responses to betrayal, this study provides the only experimental evidence that inducing changes to subjective commitment can causally affect forgiveness responses. This Registered Replication Report (RRR) meta-analytically combines the results of 16 new direct replications of the original study, all of which followed a standardized, vetted, and preregistered protocol. The results showed little effect of the priming manipulation on the forgiveness outcome measures, but it also did not observe an effect of priming on subjective commitment, so the manipulation did not work as it had in the original study. We discuss possible explanations for the discrepancy between the findings from this RRR and the original study.
Article
Full-text available
According to the facial feedback hypothesis, people’s affective responses can be influenced by their own facial expression (e.g., smiling, pouting), even when their expression did not result from their emotional experiences. For example, Strack, Martin, and Stepper (1988) instructed participants to rate the funniness of cartoons using a pen that they held in their mouth. In line with the facial feedback hypothesis, when participants held the pen with their teeth (inducing a “smile”), they rated the cartoons as funnier than when they held the pen with their lips (inducing a “pout”). This seminal study of the facial feedback hypothesis has not been replicated directly. This Registered Replication Report describes the results of 17 independent direct replications of Study 1 from Strack et al. (1988), all of which followed the same vetted protocol. A meta-analysis of these studies examined the difference in funniness ratings between the “smile” and “pout” conditions. The original Strack et al. (1988) study reported a rating difference of 0.82 units on a 10-point Likert scale. Our meta-analysis revealed a rating difference of 0.03 units with a 95% confidence interval ranging from −0.11 to 0.16.
Article
Full-text available
There have been frequent expressions of concern over the supposed failure of researchers to conduct replication studies. But the large number of meta-analyses in our literatures shows that replication studies are in fact being conducted in most areas of research. Many who argue for replication as the “gold standard” consider a nonsignificant replication attempt to be strong evidence against the initial study, an interpretation that ignores statistical power, typically low in behavioral research. Many researchers also hold that there is no need to replicate a nonsignificant finding, believing it will always replicate perfectly, an erroneous belief. These beliefs lead to a widely accepted sequential model of the research process that is deficient because it assumes that a single study can answer a research question, a belief that meta-analysis has shown to be false. Meta-analysis can provide the solution to these problems if the problems of publication bias and questionable research practices are successfully addressed. The real problem is not a lack of replication; it is the distortion of our research literatures caused by publication bias and questionable research practices.
Article
Full-text available
Replication initiatives in psychology continue to gather considerable attention from far outside the field, as well as controversy from within. Some accomplishments of these initiatives are noted, but this article focuses on why they do not provide a general solution for what ails psychology. There are inherent limitations to mass replications ever being conducted in many areas of psychology, both in terms of their practicality and their prospects for improving the science. Unnecessary compromises were built into the ground rules for design and publication of the Open Science Collaboration: Psychology that undermine its effectiveness. Some ground rules could actually be flipped into guidance for how not to conduct replications. Greater adherence to best publication practices, transparency in the design and publishing of research, strengthening of independent post-publication peer review and firmer enforcement of rules about data sharing and declarations of conflict of interest would make many replications unnecessary. Yet, it has been difficult to move beyond simple endorsement of these measures to consistent implementation. Given the strong institutional support for questionable publication practices, progress will depend on effective individual and collective use of social media to expose lapses and demand reform. Some recent incidents highlight the necessity of this.
Article
Full-text available
Poor research design and data analysis encourage false-positive findings. Such poor methods persist despite perennial calls for improvement, suggesting that they result from something more than just misunderstanding. The persistence of poor methods results partly from incentives that favor them, leading to the natural selection of bad science. This dynamic requires no conscious strategizing---no deliberate cheating nor loafing---by scientists, only that publication is a principle factor for career advancement. Some normative methods of analysis have almost certainly been selected to further publication instead of discovery. In order to improve the culture of science, a shift must be made away from correcting misunderstandings and towards rewarding understanding. We support this argument with empirical evidence and computational modeling. We first present a 60-year meta-analysis of statistical power in the behavioral sciences and show that power has not improved despite repeated demonstrations of the necessity of increasing power. To demonstrate the logical consequences of structural incentives, we then present a dynamic model of scientific communities in which competing laboratories investigate novel or previously published hypotheses using culturally transmitted research methods. As in the real world, successful labs produce more "progeny", such that their methods are more often copied and their students are more likely to start labs of their own. Selection for high output leads to poorer methods and increasingly high false discovery rates. We additionally show that replication slows but does not stop the process of methodological deterioration. Improving the quality of research requires change at the institutional level.
Article
Full-text available
Significance Scientific progress requires that findings can be reproduced by other scientists. However, there is widespread debate in psychology (and other fields) about how to interpret failed replications. Many have argued that contextual factors might account for several of these failed replications. We analyzed 100 replication attempts in psychology and found that the extent to which the research topic was likely to be contextually sensitive (varying in time, culture, or location) was associated with replication success. This relationship remained a significant predictor of replication success even after adjusting for characteristics of the original and replication studies that previously had been associated with replication success (e.g., effect size, statistical power). We offer recommendations for psychologists and other scientists interested in reproducibility.
Article
Full-text available
Replication is vital for increasing precision and accuracy of scientific claims. However, when replications “succeed” or “fail,” they could have reputational consequences for the claim’s originators. Surveys of United States adults (N = 4,786), undergraduates (N = 428), and researchers (N = 313) showed that reputational assessments of scientists were based more on how they pursue knowledge and respond to replication evidence, not whether the initial results were true. When comparing one scientist that produced boring but certain results with another that produced exciting but uncertain results, opinion favored the former despite researchers’ belief in more rewards for the latter. Considering idealized views of scientific practices offers an opportunity to address incentives to reward both innovation and verification.
Article
Full-text available
Good self-control has been linked to adaptive outcomes such as better health, cohesive personal relationships, success in the workplace and at school, and less susceptibility to crime and addictions. In contrast, self-control failure is linked to maladaptive outcomes. Understanding the mechanisms by which self-control predicts behavior may assist in promoting better regulation and outcomes. A popular approach to understanding self-control is the strength or resource depletion model. Self-control is conceptualized as a limited resource that becomes depleted after a period of exertion resulting in self-control failure. The model has typically been tested using a sequential-task experimental paradigm, in which people completing an initial self-control task have reduced self-control capacity and poorer performance on a subsequent task, a state known as ego depletion. Although a meta-analysis of ego-depletion experiments found a medium-sized effect, subsequent meta-analyses have questioned the size and existence of the effect and identified instances of possible bias. The analyses served as a catalyst for the current Registered Replication Report of the ego-depletion effect. Multiple laboratories (k = 23, total N = 2,141) conducted replications of a standardized ego-depletion protocol based on a sequential-task paradigm by Sripada et al. Meta-analysis of the studies revealed that the size of the ego-depletion effect was small with 95% confidence intervals (CIs) that encompassed zero (d = 0.04, 95% CI [−0.07, 0.15]. We discuss implications of the findings for the ego-depletion effect and the resource depletion model of self-control.
Article
Full-text available
There is considerable current debate about the need for replication in the science of social psychology. Most of the current discussion and approbation is centered on direct or exact replications, the attempt to conduct a study in a manner as close to the original as possible. We focus on the value of conceptual replications, the attempt to test the same theoretical process as an existing study, but that uses methods that vary in some way from the previous study. The tension between the two kinds of replication is a tension of values—exact replications value confidence in operationalizations; their requirement tends to favor the status quo. Conceptual replications value confidence in theory; their use tends to favor rapid progress over ferreting out error. We describe the many ways in which conceptual replications can be superior to direct replications. We further argue that the social system of science is quite robust to these threats and is self-correcting.
Article
Full-text available
In contrast to the truncated view that replications have only a little to offer beyond what is already known, we suggest a broader understanding of replications: We argue that replications are better conceptualized as a process of conducting consecutive studies that increasingly consider alternative explanations, critical contingencies, and real-world relevance. To reflect this understanding, we collected and summarized the existing literature on replications and combined it into a comprehensive overall typology that simplifies and restructures existing approaches. The resulting typology depicts how multiple, hierarchically structured replication studies guide the integration of laboratory and field research and advance theory. It can be applied to (a) evaluate a theory's current status, (b) guide researchers' decisions, (c) analyze and argue for the necessity of certain types of replication studies, and (d) assess the added value of a replication study at a given state of knowledge. We conclude with practical recommendations for different protagonists in the field (e.g., authors, reviewers, editors, and funding agencies). Together, our comprehensive typology and the related recommendations will contribute to an enhanced replication culture in social psychology and to a stronger real-world impact of the discipline.
Article
Full-text available
Recently, many psychological effects have been surprisingly difficult to reproduce. This article asks why, and investigates whether conceptually replicating an effect in the original publication is related to the success of independent, direct replications. Two prominent accounts of low reproducibility make different predictions in this respect. One account suggests that psychological phenomena are dependent on unknown contexts that are not reproduced in independent replication attempts. By this account, internal replications indicate that a finding is more robust and, thus, that it is easier to independently replicate it. An alternative account suggests that researchers employ questionable research practices (QRPs), which increase false positive rates. By this account, the success of internal replications may just be the result of QRPs and, thus, internal replications are not predictive of independent replication success. The data of a large reproducibility project support the QRP account: replicating an effect in the original publication is not related to independent replication success. Additional analyses reveal that internally replicated and internally unreplicated effects are not very different in terms of variables associated with replication success. Moreover, social psychological effects in particular appear to lack any benefit from internal replications. Overall, these results indicate that, in this dataset at least, the influence of QRPs is at the heart of failures to replicate psychological findings, especially in social psychology. Variable, unknown contexts appear to play only a relatively minor role. I recommend practical solutions for how QRPs can be avoided. Electronic supplementary material The online version of this article (doi:10.3758/s13423-016-1030-9) contains supplementary material, which is available to authorized users.
Article
Full-text available
Gilbert et al. conclude that evidence from the Open Science Collaboration’s Reproducibility Project: Psychology indicates high reproducibility, given the study methodology. Their very optimistic assessment is limited by statistical misconceptions and by causal inferences from selectively interpreted, correlational data. Using the Reproducibility Project: Psychology data, both optimistic and pessimistic conclusions about reproducibility are possible, and neither are yet warranted.
Article
Full-text available
We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors-a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis-for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable.
Article
Full-text available
Language can be viewed as a complex set of cues that shape people’s mental representations of situations. For example, people think of behavior described using imperfective aspect (i.e., what a person was doing) as a dynamic, unfolding sequence of actions, whereas the same behavior described using perfective aspect (i.e., what a person did) is perceived as a completed whole. A recent study found that aspect can also influence how we think about a person’s intentions (Hart & Albarracín, 2011). Participants judged actions described in imperfective as being more intentional (d between 0.67 and 0.77) and they imagined these actions in more detail (d = 0.73). The fact that this finding has implications for legal decision making, coupled with the absence of other direct replication attempts, motivated this registered replication report (RRR). Multiple laboratories carried out 12 direct replication studies, including one MTurk study. A meta-analysis of these studies provides a precise estimate of the size of this effect free from publication bias. This RRR did not find that grammatical aspect affects intentionality (d between 0 and −0.24) or imagery (d = −0.08). We discuss possible explanations for the discrepancy between these results and those of the original study.
Article
Full-text available
Scientists are dedicating more attention to replication efforts. While the scientific utility of replications is unquestionable, the impact of failed replication efforts and the discussions surrounding them deserve more attention. Specifically, the debates about failed replications on social media have led to worry, in some scientists, regarding reputation. In order to gain data-informed insights into these issues, we collected data from 281 published scientists. We assessed whether scientists overestimate the negative reputational effects of a failed replication in a scenario-based study. Second, we assessed the reputational consequences of admitting wrongness (versus not) as an original scientist of an effect that has failed to replicate. Our data suggests that scientists overestimate the negative reputational impact of a hypothetical failed replication effort. We also show that admitting wrongness about a non-replicated finding is less harmful to one's reputation than not admitting. Finally, we discovered a hint of evidence that feelings about the replication movement can be affected by whether replication efforts are aimed one's own work versus the work of another. Given these findings, we then present potential ways forward in these discussions.
Article
Full-text available
The data includes measures collected for the two experiments reported in “False-Positive Psychology” [1] where listening to a randomly assigned song made people feel younger (Study 1) or actually be younger (Study 2). These data are useful because they illustrate inflations of false positive rates due to flexibility in data collection, analysis, and reporting of results. Data are useful for educational purposes.
Article
Full-text available
This article brings attention to some historical developments that gave rise to the Bayes factor for testing a point null hypothesis against a composite alternative. In line with current thinking, we find that the conceptual innovation - to assign prior mass to a general law - is due to a series of three articles by Dorothy Wrinch and Sir Harold Jeffreys (1919, 1921, 1923). However, our historical investigation also suggests that in 1932 J.B.S. Haldane made an important contribution to the development of the Bayes factor by proposing the use of a mixture prior comprising a point mass and a continuous probability density. Jeffreys was aware of Haldane's work and it may have inspired him to pursue a more concrete statistical implementation for his conceptual ideas. It thus appears that Haldane may have played a much bigger role in the statistical development of the Bayes factor than has hitherto been assumed.
Article
Full-text available
Crisis of replicability is one term that psychological scientists use for the current introspective phase we are in-I argue instead that we are going through a revolution analogous to a political revolution. Revolution 2.0 is an uprising focused on how we should be doing science now (i.e., in a 2.0 world). The precipitating events of the revolution have already been well-documented: failures to replicate, questionable research practices, fraud, etc. And the fact that none of these events is new to our field has also been well-documented. I suggest four interconnected reasons as to why this time is different: changing technology, changing demographics of researchers, limited resources, and misaligned incentives. I then describe two reasons why the revolution is more likely to catch on this time: technology (as part of the solution) and the fact that these concerns cut across social and life sciences-that is, we are not alone. Neither side in the revolution has behaved well, and each has characterized the other in extreme terms (although, of course, each has had a few extreme actors). Some suggested reforms are already taking hold (e.g., journals asking for more transparency in methods and analysis decisions; journals publishing replications) but the feared tyrannical requirements have, of course, not taken root (e.g., few journals require open data; there is no ban on exploratory analyses). Still, we have not yet made needed advances in the ways in which we accumulate, connect, and extract conclusions from our aggregated research. However, we are now ready to move forward by adopting incremental changes and by acknowledging the multiplicity of goals within psychological science.
Article
Full-text available
Harold Jeffreys pioneered the development of default Bayes factor hypothesis tests for standard statistical problems. Using Jeffreys's Bayes factor hypothesis tests, researchers can grade the decisiveness of the evidence that the data provide for a point null hypothesis H0 versus a composite alternative hypothesis H1. Consequently, Jeffreys's tests are of considerable theoretical and practical relevance for empirical researchers in general and for experimental psychologists in particular. To highlight this relevance and to facilitate the interpretation and use of Jeffreys's Bayes factor tests we focus on two common inferential scenarios: testing the nullity of a normal mean (i.e.,the Bayesian equivalent of the t-test) and testing the nullity of a correlation. For both Bayes factor tests, we explain their development, we extend them to one-sided problems, and we apply them to concrete examples from experimental psychology.
Article
Full-text available
This crowdsourced project introduces a collaborative approach to improving the reproducibility of scientific research, in which findings are replicated in qualified independent laboratories before (rather than after) they are published. Our goal is to establish a non-adversarial replication process with highly informative final results. To illustrate the Pre-Publication Independent Replication (PPIR) approach, 25 research groups conducted replications of all ten moral judgment effects which the last author and his collaborators had "in the pipeline" as of August 2014. Six findings replicated according to all replication criteria, one finding replicated but with a significantly smaller effect size than the original, one finding replicated consistently in the original culture but not outside of it, and two findings failed to find support. In total, 40% of the original findings failed at least one major replication criterion. Potential ways to implement and incentivize pre-publication independent replication on a large scale are discussed.
Article
Full-text available
Many recent discussions have focused on the role of replication in psychological science. In this article, we examine three key issues in evaluating the conclusions that follow from results of studies at least partly aimed at replicating previous results: the evaluation and status of exact versus conceptual replications, the statistical evaluation of replications, and the robustness of research findings to potential existing or future “non-replications.” In the first section of the article, we discuss the sources of ambiguity in evaluating failures to replicate in exact as well as conceptual replications. In addressing these ambiguities, we emphasize the key role of psychometric invariance of the independent and dependent variables in evaluations of replications. In the second section of the article, we use a meta-analytic framework to discuss the statistical status of replication attempts. We emphasize meta-analytic tools that have been used too sparingly, especially in evaluation of sets of studies within a single article or focused program of research. In the final section of the article, we extend many of these meta-analytic tools to the evaluation of the robustness of a body of research to potential existing or future failures to replicate previous statistically significant results.
Article
Full-text available
Psychology has recently been viewed as facing a replication crisis because efforts to replicate past study findings frequently do not show the same result. Often, the first study showed a statistically significant result but the replication does not. Questions then arise about whether the first study results were false positives, and whether the replication study correctly indicates that there is truly no effect after all. This article suggests these so-called failures to replicate may not be failures at all, but rather are the result of low statistical power in single replication studies, and the result of failure to appreciate the need for multiple replications in order to have enough power to identify true effects. We provide examples of these power problems and suggest some solutions using Bayesian statistics and meta-analysis. Although the need for multiple replication studies may frustrate those who would prefer quick answers to psychology's alleged crisis, the large sample sizes typically needed to provide firm evidence will almost always require concerted efforts from multiple investigators. As a result, it remains to be seen how many of the recently claimed failures to replicate will be supported or instead may turn out to be artifacts of inadequate sample sizes and single study replications. (PsycINFO Database Record
Article
Full-text available
We present a suite of Bayes factor hypothesis tests that allow researchers to grade the decisiveness of the evidence that the data provide for the presence versus the absence of a correlation between two variables. For concreteness, we apply our methods to the recent work of Donnellan et al. (in press) who conducted nine replication studies with over 3,000 participants and failed to replicate the phenomenon that lonely people compensate for a lack of social warmth by taking warmer baths or showers. We show how the Bayes factor hypothesis test can quantify evidence in favor of the null hypothesis, and how the prior specification for the correlation coefficient can be used to define a broad range of tests that address complementary questions. Specifically, we show how the prior specification can be adjusted to create a two-sided test, a one-sided test, a sensitivity analysis, and a replication test.
Article
Full-text available
In 2012, the American Political Science Association (APSA) Council adopted new policies guiding data access and research transparency in political science. The policies appear as a revision to APSA's Guide to Professional Ethics in Political Science. The revisions were the product of an extended and broad consultation with a variety of APSA committees and the association's membership.
Chapter
Two books have been particularly influential in contemporary philosophy of science: Karl R. Popper's Logic of Scientific Discovery, and Thomas S. Kuhn's Structure of Scientific Revolutions. Both agree upon the importance of revolutions in science, but differ about the role of criticism in science's revolutionary growth. This volume arose out of a symposium on Kuhn's work, with Popper in the chair, at an international colloquium held in London in 1965. The book begins with Kuhn's statement of his position followed by seven essays offering criticism and analysis, and finally by Kuhn's reply. The book will interest senior undergraduates and graduate students of the philosophy and history of science, as well as professional philosophers, philosophically inclined scientists, and some psychologists and sociologists.
Article
We studied publication bias in the social sciences by analyzing a known population of conducted studies—221 in total—in which there is a full accounting of what is published and unpublished. We leveraged Time-sharing Experiments in the Social Sciences (TESS), a National Science Foundation–sponsored program in which researchers propose survey-based experiments to be run on representative samples of American adults. Because TESS proposals undergo rigorous peer review, the studies in the sample all exceed a substantial quality threshold. Strong results are 40 percentage points more likely to be published than are null results and 60 percentage points more likely to be written up. We provide direct evidence of publication bias and identify the stage of research production at which publication bias occurs: Authors do not write up and submit null findings.
Article
The first results from the Reproducibility Project: Cancer Biology suggest that there is scope for improving reproducibility in pre-clinical cancer research.
Article
Psychological scientists draw inferences about populations based on samples—of people, situations, and stimuli—from those populations. Yet, few papers identify their target populations, and even fewer justify how or why the tested samples are representative of broader populations. A cumulative science depends on accurately characterizing the generality of findings, but current publishing standards do not require authors to constrain their inferences, leaving readers to assume the broadest possible generalizations. We propose that the discussion section of all primary research articles specify Constraints on Generality (i.e., a “COG” statement) that identify and justify target populations for the reported findings. Explicitly defining the target populations will help other researchers to sample from the same populations when conducting a direct replication, and it could encourage follow-up studies that test the boundary conditions of the original finding. Universal adoption of COG statements would change publishing incentives to favor a more cumulative science.