ArticlePDF Available

False-Positive Psychology

Authors:

Abstract and Figures

In this article, we accomplish two things. First, we show that despite empirical psychologists' nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.
Content may be subject to copyright.
A preview of the PDF is not available
... One event was the publication of Bem [9] who across nine experiments reported evidence of precognition-a phenomenon which proposes that people's conscious awareness of future events can influence current ones. Surprised by how these findings could be published, many researchers voiced concerns about the inherent flexibility involved in the process of designing and analysing scientific studies, with such 'researcher degrees of freedom' likely leading to a prevalence of false positives in the published literature [10,11]. Independent teams of researchers subsequently failed to replicate Bem's findings [12,13]. ...
... The discipline of meta-science-the scientific study of science itself-has shed light on many intertwining contributors to low replicability, reproducibility and transparency in research [21]. For example, researchers have outlined numerous questionable research practices (QRPs), such as hypothesizing after the results are known (HARKing) [22], and p-hacking techniques that exponentially increase the likelihood of detecting false positives [10,[23][24][25]. Furthermore, academic incentive structures have received greater critical revaluation for their focus on quantity over quality, arguably contributing to weak specification of theories and analysis plans, inadequate statistical power, poor measurement, a lack of replication and reproducibility checks, and non-transparent reporting (see [26,27]). ...
Article
Full-text available
Concerns about the replicability, reproducibility and transparency of research have ushered in a set of practices and behaviours under the umbrella of ‘open research’. To this end, many new initiatives have been developed that represent procedural (i.e. behaviours and sets of commonly used practices in the research process), structural (new norms, rules, infrastructure and incentives), and community-based change (working groups, networks). The objectives of this research were to identify and outline international initiatives that enhance awareness and uptake of open research practices in the discipline of psychology. A systematic mapping review was conducted in three stages: (i) a Web search to identify open research initiatives in psychology; (ii) a literature search to identify related articles; and (iii) a hand search of grey literature. Eligible initiatives were then coded into an overarching theme of procedural, structural or community-based change. A total of 187 initiatives were identified; 30 were procedural (e.g. toolkits, resources, software), 70 structural (e.g. policies, strategies, frameworks) and 87 community-based (e.g. working groups, networks). This review highlights that open research is progressing at pace through various initiatives that share a common goal to reform research culture. We hope that this review promotes their further adoption and facilitates coordinated efforts between individuals, organizations, institutions, publishers and funders.
... Nevertheless, simulation studies have shown that these violations could be counterbalanced by the other fulfilled criteria in the same data set (Ruscio et al., 2011). Additionally, CCFI reduces the chance of and prevents confirmation bias when interpreting the TA results (Ruscio and Kaczetow, 2009;Simmons et al., 2011). Furthermore, the nature of the latent constructs that were investigated may vary with sample type regarding individual characteristics, age, gender, and even cultural differences (Fiske, 2002), which were not examined here. ...
Article
Full-text available
Taxometric analysis (TA) is a technique designed to elucidate the structure of a psychological construct, specifically determining whether the latent variable is categorical (taxon) or dimensional. The taxon hypothesis is significant because the structure of a latent construct influences how we conceptualize, characterize, and measure it, thereby impacting the methodologies employed in both research and practical applications. In this study, data from two separate studies were subjected to TA. Study 1 involves secondary school students ( N = 2024) and explores factors such as Achievement Goals and Self-Efficacy within the context of language acquisition. Study 2 examines issues among service teachers ( N = 494) and includes variables such as Attitudes, Self-Efficacy, Commitment, and Cognitive and Affective conditions within the framework of STEM education. Given that the taxon hypothesis is tested for the first time using these types of psychoeducational data, Taxometrics is applied in an exploratory manner to provide a deeper understanding of the nature of these constructs. The results of TA are based on a series of indicators that identified cases of dimensional constructs when items from a single dimension were used as input. However, when all elements related to achievement goals and teacher readiness were utilized as input, the results revealed ambiguous latent structures. This emerging ambiguity prompts theoretical and epistemological discourse to explain the findings and advocate for a reevaluation of the nature of latent psychoeducational constructs.
... We recruited as many children as possible, ensuring that each group exceeded 20 participants (Simmons et al., 2011). A post hoc power analysis conducted with G* Power, version 3.1.9.6 (Faul et al., 2009) indicated that with 110 children, we had substantial power (80%) to detect medium-sized differences in improvement between the music and the other two groups (r = .26, ...
Article
Full-text available
Music training is widely claimed to enhance nonmusical abilities, yet causal evidence remains inconclusive. Moreover, research tends to focus on cognitive over socioemotional outcomes. In two studies, we investigated whether music training improves emotion recognition in voices and faces among school-aged children. We also examined music-training effects on musical abilities, motor skills (fine and gross), broader socioemotional functioning, and cognitive abilities including nonverbal reasoning, executive functions, and auditory memory (short-term and working memory). Study 1 (N = 110) was a 2-year longitudinal intervention conducted in a naturalistic school setting, comparing music training to basketball training (active control) and no training (passive control). Music training improved fine-motor skills and auditory memory relative to controls, but it had no effect on emotion recognition or other cognitive and socioemotional abilities. Both music and basketball training improved gross-motor skills. Study 2 (N = 192) compared children without music training to peers attending a music school. Although music training correlated with better emotion recognition in speech prosody (tone of voice), this association disappeared after controlling for socioeconomic status, musical abilities, or short-term memory. In contrast, musical abilities correlated with emotion recognition in both prosody and faces, independently of training or other confounding variables. These findings suggest that music training enhances fine-motor skills and auditory memory, but it does not causally improve emotion recognition, other cognitive abilities, or socioemotional functioning. Observed advantages in emotion recognition likely stem from preexisting musical abilities and other confounding factors such as socioeconomic status.
Article
Socially Assistive Robots are studied in different Child-Robot Interaction settings. However, logistical constraints limit accessibility, particularly affecting timely support for mental wellbeing. In this work, we have investigated whether online interactions with a robot can be used for the assessment of mental wellbeing in children. The children (N=40, 20 girls and 20 boys; 8-13 years) interacted with the Nao robot (30-45 mins) over three sessions, at least a week apart. Audio-visual recordings were collected throughout the sessions that concluded with the children answering user perception questionnaires pertaining to their anxiety towards the robot, and the robot's abilities. We divided the participants into three wellbeing clusters (low, med and high tertiles) using their responses to the Short Moods and Feelings Questionnaire (SMFQ) and further analysed how their wellbeing and their perceptions of the robot changed over the wellbeing tertiles, across sessions and across participants’ gender. Our primary findings suggest that (I) online mediated-interactions with robots can be effective in assessing children's mental wellbeing over time, and (II) children's overall perception of the robot either improved or remained consistent across time. Supplementary exploratory analyses have also revealed that the gender of the children affected their wellbeing assessments with interactions effectively distinguishing between varying levels of wellbeing for both boys and girls for the first session and only for boys during the second session. The analyses have also revealed that girls have a higher opinion of the robot as a confidante as compared with boys. Findings from this work affirm the potential of using online mediated interactions with robots for the assessment of the mental wellbeing of children.
Article
The recognition that researcher discretion coupled with unconscious biases and motivated reasoning sometimes leads to false findings (“p-hacking”) led to the broad embrace of study preregistration and other open-science practices in experimental research. Paradoxically, the preregistration of quasi-experimental studies remains uncommon although such studies involve far more discretionary decisions and are the most prevalent approach to making causal claims in the social sciences. I discuss several forms of recent empirical evidence indicating that questionable research practices contribute to the comparative unreliability of quasi-experimental research and advocate for adopting the preregistration of such studies. The implementation of this recommendation would benefit from further consideration of key design details (e.g., how to balance data cleaning with credible preregistration) and a shift in research norms to allow for appropriately nuanced sensemaking across prespecified, confirmatory results and other exploratory findings.
Article
Full-text available
Mapping biological mechanisms in cellular systems is a fundamental step in early-stage drug discovery that serves to generate hypotheses on what disease-relevant molecular targets may effectively be modulated by pharmacological interventions. With the advent of high-throughput methods for measuring single-cell gene expression under genetic perturbations, we now have effective means for generating evidence for causal gene-gene interactions at scale. However, evaluating the performance of network inference methods in real-world environments is challenging due to the lack of ground-truth knowledge. Moreover, traditional evaluations conducted on synthetic datasets do not reflect the performance in real-world systems. We thus introduce CausalBench, a benchmark suite revolutionizing network inference evaluation with real-world, large-scale single-cell perturbation data. CausalBench, distinct from existing benchmarks, offers biologically-motivated metrics and distribution-based interventional measures, providing a more realistic evaluation of network inference methods. An initial systematic evaluation of state-of-the-art causal inference methods using our CausalBench suite highlights how poor scalability of existing methods limits performance. Moreover, methods that use interventional information do not outperform those that only use observational data, contrary to what is observed on synthetic benchmarks. CausalBench subsequently enables the development of numerous promising methods through a community challenge, thus demonstrating its potential as a transformative tool in the field of computational biology, bridging the gap between theoretical innovation and practical application in drug discovery and disease understanding. Thus, CausalBench opens new avenues for method developers in causal network inference research, and provides to practitioners a principled and reliable way to track progress in network methods for real-world interventional data.
Article
Full-text available
The term enthusiasm is used frequently in both day-to-day language and professional settings. Scientifically, however, enthusiasm is not clearly defined. It is conceptualized and measured in different ways. In the present research, we examined the internal structure of enthusiasm. First, 28 features of enthusiasm were identified (Study 1.1) and rated on their centrality (Study 1.2). Results showed that features indicating joy and motivation were rated as central to the concept of enthusiasm, whereas features indicating restlessness and impatience were rated as less central. The validity of the central features was supported in three follow-up studies. More specifically, we found that the more central features were recalled better (Study 2.1), recognized faster (Study 2.2), and more often mentioned in autobiographical recalls of enthusiasm (Study 2.3). Taken together, the findings indicate that enthusiasm is prototypically structured, and that prototypical enthusiasm is a positive, energetic feeling that is associated with goal orientation and often involves interpersonal contact.
Article
The Methodology Corner has opted to call attention to questionable research practices in 2025. This first column of the year specifically looks at p -hacking. Studies subject to p-hacking may be harmful to the science of nursing education. This article provides recommendations for the nursing education community on how to help prevent p-hacking in the nursing literature. [ J Nurs Educ . 2025;64(3):211–212.]
Article
Full-text available
Past experimental research shows that prosocial behavior promotes happiness. But do past findings hold up to current standards of consistent, rigorous, and generalizable evidence? In this review, we considered the evidentiary value of past experiments examining the happiness (i.e., subjective well-being; SWB) benefits of prosocial action, such as spending money on others or acts of kindness, in non-clinical samples. Specifically, we examined: (1) how consistent findings are across meta-analyses, (2) the conclusions of pre-registered, well-powered experiments, and (3) if the SWB benefits of prosociality are detectable beyond WEIRD (White-Western, Educated, Industrialized, Rich, Democratic) samples. Across the two meta-analyses we found, prosocial behavior led to a small consistent increase in happiness, yet estimates were based primarily on underpowered and WEIRD samples. We identified a growing number of pre-registered experiments (19/71 conducted to date), in which: (1) roughly half were well-powered; (2) only two recruited non-WEIRD samples, both underpowered and collectively showing mixed results; and (3) most examined prosocial spending (79%) over other prosocial behaviors, with happiness gains observed most consistently in well-powered studies on prosocial spending. Finally, we found that just 19% of all experiments recruited non-WEIRD samples, most of which were underpowered and presented mixed results, with acts of prosocial spending demonstrating the most consistent evidence of happiness benefits. We join other researchers in urging for more well-powered pre-registered experiments examining various prosocial behaviors, particularly with Global Majority samples, to ensure that our understanding of the SWB benefits of prosociality are firmly grounded in solid and inclusive evidence.
Article
Full-text available
People tend to approach agreeable propositions with a bias toward confirmation and disagreeable propositions with a bias toward disconfirmation. Because the appropriate strategy for solving the four-card Wason selection task is to seek disconfirmation, the authors predicted that people motivated to reject a task rule should be more likely to solve the task than those without such motivation. In two studies, participants who considered a Wason task rule that implied their own early death (Study 1) or the validity of a threatening stereotype (Study 2) vastly outperformed participants who considered nonthreatening or agreeable rules. Discussion focuses on how a skeptical mindset may help people avoid confirmation bias both in the context of the Wason task and in everyday reasoning.
Article
Full-text available
Some effects diminish when tests are repeated. Jonathan Schooler says being open about findings that don't make the scientific record could reveal why.
Article
Full-text available
Does psi exist? D. J. Bem (2011) conducted 9 studies with over 1,000 participants in an attempt to demonstrate that future events retroactively affect people's responses. Here we discuss several limitations of Bem's experiments on psi; in particular, we show that the data analysis was partly exploratory and that one-sided p values may overstate the statistical evidence against the null hypothesis. We reanalyze Bem's data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent. We argue that in order to convince a skeptical audience of a controversial claim, one needs to conduct strictly confirmatory studies and analyze the results with statistical tests that are conservative rather than liberal. We conclude that Bem's p values do not indicate evidence in favor of precognition; instead, they indicate that experimental psychologists need to change the way they conduct their experiments and analyze their data.
Article
Full-text available
It is proposed that motivation may affect reasoning through reliance on a biased set of cognitive processes--that is, strategies for accessing, constructing, and evaluating beliefs. The motivation to be accurate enhances use of those beliefs and strategies that are considered most appropriate, whereas the motivation to arrive at particular conclusions enhances use of those that are considered most likely to yield the desired conclusion. There is considerable evidence that people are more likely to arrive at conclusions that they want to arrive at, but their ability to do so is constrained by their ability to construct seemingly reasonable justifications for these conclusions. These ideas can account for a wide variety of research concerned with motivated reasoning.
Article
Full-text available
Perhaps the most striking aspect of gambling behavior is that people continue to gamble despite persistent failure. One reason for this persistence may be that gamblers evaluate outcomes in a biased manner. Specifically, gamblers may tend to accept wins at face value but explain away or discount losses. Experiment 1 tested this hypothesis by recording subjects' explanations of the outcomes of their bets on professional football games. The results supported the hypothesis: Subjects spent more time explaining their losses than their wins. A content analysis of these explanations revealed that subjects tended to discount their losses but "bolster" their wins Finally, subjects remembered their losses better during a recall test 3 weeks later. Experiments 2 and 3 extended this analysis by demonstrating that a manipulation of the salience or existence of a critical "fluke" play in a sporting event had a greater impact on the subsequent expectations of those who had bet on the losing team than of those who had bet on the winning team. Both the implications and the possible mechanisms underlying these biases are discussed.
Article
Full-text available
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Article
A bstract Do causal attributions serve the need to protect and / or enhance self‐esteem? In a recent review, Miller and Ross (1975) proposed that there is evidence for self‐serving effect in the attribution of success but not in the attribution of failure; and that this effect reflects biases in information‐processing rather than self‐esteem maintenance. The present review indicated that self‐serving effects for both success and failure are obtained in most but not all experimental paradigms. Processes which may suppress or even reverse the self‐serving effect were discussed. Most important, the examination of research in which self‐serving effects are obtained suggested that these attributions are better understood in motivational than in information‐processing terms.
Article
Cases of clear scientific misconduct have received significant media attention recently, but less flagrantly questionable research practices may be more prevalent and, ultimately, more damaging to the academic enterprise. Using an anonymous elicitation format supplemented by incentives for honest reporting, we surveyed over 2,000 psychologists about their involvement in questionable research practices. The impact of truth-telling incentives on self-admissions of questionable research practices was positive, and this impact was greater for practices that respondents judged to be less defensible. Combining three different estimation methods, we found that the percentage of respondents who have engaged in questionable practices was surprisingly high. This finding suggests that some questionable practices may constitute the prevailing research norm.
Article
SUMMARY In clinical trials with sequential patient entry, fixed sample size designs are unjustified on ethical grounds and sequential designs are often impracticable. One solution is a group sequential design dividing patient entry into a number of equal-sized groups so that the decision to stop the trial or continue is based on repeated significance tests of the accumulated data after each group is evaluated. Exact results are obtained for a trial with two treatments and a normal response with known variance. The design problem of determining the required size and number of groups is also considered. Simulation shows that these normal results may be adapted to other types of response data. An example shows that group sequential designs can sometimes be statistically superior to standard sequential designs.
Article
When the Dartmouth football team played Princeton in 1951, much controversy was generated over what actually took place during the game. Basically, there was disagreement between the two schools as to what had happened during the game. A questionnaire designed to get reactions to the game and to learn something of the climate of opinion was administered at each school and the same motion picture of the game was shown to a sample of undergraduate at each school, followed by another questionnnaire. Results indicate that the "game" was actually many different games and that each version of the events that transpired was just as "real" to a particular person as other versions were to other people.