Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Replication studies are associated with different goals in the empirical sciences, depending on whether research aims at developing new theories or at testing existing theories (context of discovery vs. context of justification, cf. Reichenbach, 1938). Conceptual replications strive for generalization and can be useful in the context of discovery. Direct replications, by contrast, target the replicability of a specific empirical research result under independent conditions and are thus indispensable in the context of justification. Without assuming replicability, it is impossible to reach a consensus about generally accepted empirical facts. However, such accepted facts are mandatory for testing theories in the empirical sciences. On the basis of this framework, we suggest and motivate standards for replication studies. A characteristic feature of psychological science is the probabilistic nature of the to-be-replicated empirical claim, which typically takes the form of a statistical hypothesis. This raises a number of methodological problems concerning the nature of the replicability hypothesis, the control of error probabilities in statistical decisions about the replicability hypothesis, the determination of the to-bedetected effect size given distortions of published effect sizes by publication bias, the a priori determination of sample sizes for replication studies, and the correct interpretation of the replication rate (i. e., the success rate in a series of replication studies). We propose and discuss solutions for all these problems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In this paper, we perform both close (Studies 1 & 2) and conceptual (Studies 3 & 4) replications (Schmidt, 2009). Following the design of Meindl and Ehrlich (1987) theory-testing, Studies 1 and 2 are close replications (Erdfelder & Ulrich, 2018;Hüffmeier, Mazei, & Schultze, 2016). In addition to testing the extent to which the original study results hold, close replications estimate the effect size expected in further replications (Brandt et al., 2014). ...
... Although replication is necessary for advancing leadership science, we found few examples in the extant leadership literature to help guide our ROL replication effort. Thus, we drew heavily from commentaries published in the fields of general psychology and experimental social psychology (Brandt et al., 2014;Crandall & Sherman, 2016;Erdfelder & Ulrich, 2018;Hüffmeier et al., 2016;Schmidt, 2009) and even sourced articles published in languages other than English. We encourage future leadership replications to follow the steps we laid out in our replication approach section and to explore both types of replications: close and conceptual (Brandt et al., 2014). ...
Article
Given changes in business and society, the romance of leadership theory, which describes a glorification of the perceived influence of leaders on organizational outcomes, is arguably more relevant than at its conception over thirty years ago. This paper presents four studies aimed to replicate Meindl and Ehrlich (1987) early experiment on the romance of leadership, specifically considering the effect of leadership attributions on company evaluations. Studies 1 and 2 are close replications, whereas Studies 3 and 4 provide a conceptual replication drawing from a broader sample in age and work experience and include additional experimental conditions. These conditions vary the gender of the leader and including both success and failure situations, as well as including additional outcomes variables of participants’ behavioral intentions to support, invest, seek employment, or purchase from the company. Taken together, these studies do not support Meindl and Ehrlich’s findings that organizations are viewed more favorably when such outcomes are attributed to leadership. We discuss implications for the romance of leadership theory.
... Ceteris paribus, as these errors increase, the replication-probability of a true hypothesis decreases, thus lowering the chance that a replication attempt obtains a similar data-pattern as the original study. Since NHST remains the statistical inference strategy in empirical psychology, many today (rightly) view the field as undergoing a replicability-crisis (Erdfelder and Ulrich, 2018). ...
... The six RPS steps thus obtain a parameter we can trust to the extent that we accept the error probabilities. Unless strong reasons motivate doubt that our data are faithful, indeed, the certainty we invest into this parameter ought to mirror (1-β), i.e., the replication-probability of data closely matching a true hypothesis (Miller and Ulrich, 2016;Erdfelder and Ulrich, 2018). ...
Article
Full-text available
In psychology as elsewhere, the main statistical inference strategy to establish empirical effects is null-hypothesis significance testing (NHST). The recent failure to replicate allegedly well-established NHST-results, however, implies that such results lack sufficient statistical power, and thus feature unacceptably high error-rates. Using data-simulation to estimate the error-rates of NHST-results, we advocate the research program strategy (RPS) as a superior methodology. RPS integrates Frequentist with Bayesian inference elements, and leads from a preliminary discovery against a (random) H0-hypothesis to a statistical H1-verification. Not only do RPS-results feature significantly lower error-rates than NHST-results, RPS also addresses key-deficits of a “pure” Frequentist and a standard Bayesian approach. In particular, RPS aggregates underpowered results safely. RPS therefore provides a tool to regain the trust the discipline had lost during the ongoing replicability-crisis.
... Im zweiten Retentionstest 6 Wochen nach der Aneignungsphase befindet sich der frontale Kniewinkel beider Beine jedoch unter dem Ausgangsniveau im Pretest. Ob es sich dabei um ein theoretisch-inhaltlich zu diskutierendes Ergebnis handelt, sollte zunächst in konzeptuellen und direkten Replikationsstudien geprüft werden [38]. Die Ergebnisse liefern im Gegensatz zu den Ergebnissen der vorliegenden Studien erste Erkenntnisse hinsichtlich der mittel-und langfristigen Wirksamkeit eines einmaligen Video-Feedback-Trainings bezogen auf die potenzielle Verbesserung des frontalen Kniewinkels bei einbeinigen Landungen. ...
Article
Zusammenfassung Hintergrund Im Sport wird angenommen, dass ein dynamischer Knievalgus bei einbeinigen Landungen einen relevanten Verletzungsmechanismus des vorderen Kreuzbands darstellt. Während bestehende effektive Präventionsprogramme zur Reduktion des Knievalgus primär auf die Verbesserung allgemeiner konditioneller und/oder koordinativer Einflussgrößen ausgerichtet sind, fokussiert ein Video-Feedback-Training die Korrektur individueller Defizite in der sportlichen Technik, um u. a. ein potenzielles Verletzungsrisiko zu reduzieren. Ziel Evaluation der kurz- und insbesondere mittel- und langfristigen Wirksamkeit eines Video-Feedback-Trainings zur Veränderung des frontalen Kniewinkels bei einbeinigen Landungen. Methode Im Rahmen einer exploratorischen Studie wurden 10 sportlich aktive Personen (Alter: 25 ± 5 Jahre, Größe: 170,8 ± 4,5 cm) getestet. Diese führten in Anlehnung an das Landing Error Scoring System (LESS-Test) einbeinige Drop-Jumps in einem Pretest, in einer Aneignungsphase mit Video-Feedback und 2 Retentionstests 2 und 6 Wochen nach der Aneignungsphase ohne Video-Feedback aus. Das Video-Feedback wurde in der Aneignungsphase bei jedem zweiten Sprung und zusätzlich selbstbestimmt auf Nachfrage über ein Expertenmodell mit neutraler Kniestellung im Overlay-Modus aus der Frontalperspektive gegeben. Ergebnisse Die Ergebnisse wurden nach Sprung- und Nichtsprungbein der Proband*innen differenziert. Sie zeigen eine bedeutsame Verringerung des frontalen Kniewinkels für das Sprungbein (F1, 9 = 10,43, p = 0,01, η2 p = 0,54, 95 % CI [0,04; 0,74]) bei einbeinigen Landungen in der Aneignungsphase, jedoch keine statistisch bedeutsame Verringerung für das Nichtsprungbein (F1, 9 = 4,07, p = 0,08, η2 p = 0,31, 1-β = 0,44). Im Retentionstest nach 6 Wochen nähert sich der frontale Kniewinkel beidseitig dem Ausgangsniveau aus dem Pretest wieder an. Schlussfolgerung Ein Video-Feedback-Training bietet sich als einfach durchzuführendes, alternatives Verletzungspräventionsprogramm an. Eine fehlende mittel- und langfristige Veränderung und hohe Variabilität des frontalen Kniewinkels lassen eine mehrfache und/oder regelmäßige Durchführung eines Video-Feedback-Trainings sinnvoll erscheinen. In weiteren Studien mit Kontrollgruppendesign und unterschiedlichen Feedback-Prozeduren wird systematisch zu prüfen sein, ob eine längerfristige Reduktion eines potenziellen Verletzungsrisikos des vorderen Kreuzbands erreicht werden kann.
... They are not new to educational research either; for example, the past decade has been called the "decade of replication research" (Perry et al., 2022). Serving different purposes, replication studies fall into three categories: direct, approximate, and conceptual, with a decreasing orientation alongside the initial study toward the latter (Erdfelder & Ulrich, 2018;Perry et al., 2022). The study presented here can be classified as approximate/conceptual, as it combines characteristics of both types. ...
Conference Paper
Full-text available
Educational choices reproduce social inequalities. The proportion of those in a birth cohort who obtain the "Abitur"(A-Level) is constantly rising. At the same time, a larger proportion of them opts for vocational training and some also subsequently for university studies (additive double qualification). Although there is already research investigating the reasons for completing a double qualification, there are limitations: To counter geographical constraints, the presented replication study uses data from the National Educational Panel Study (NEPS) to investigate the transition into university education after completing VET with similar constructs (N = 310). Significant differences are found regarding age, gender, migration background, and cultural proximity. Supplementary analyses, using logistic regression models and FAMD, indicate the influence of cultural affection, even when controlling for socio-demographic factors.
... When speaking of replication, it is helpful to distinguish between two contexts or aims of studies (Erdfelder & Ulrich, 2018;Fiedler, 2017): discovery and justification (or verification). Studies in the discovery context can, for example, aim to address generalizations of theories or predictions by deviating from the original study in at least one relevant aspect (conceptual replications). ...
Article
Full-text available
Terror management theory (TMT) posits that mortality salience (MS) leads to more negative perceptions of persons who oppose one’s worldview and to more positive perceptions of persons who confirm one’s worldview. Recent failed replications of classic findings have thrown into question empirical validity for this established idea. We believe, that there are crucial methodological and theoretical aspects that have been neglected in these studies which limit their explanatory power; thus, the studies of this registered report aimed to address these issues and to directly test the worldview defense hypothesis. First, we conducted two preregistered lab studies applying the classic worldview defense paradigm. The stimulus material (worldview-confirming and -opposing essays) was previously validated for students at a German university. In both studies, the MS manipulation (between-subjects) was followed by a distraction phase. Then, in Study 1 (N = 131), each participant read both essays (within-subjects). In Study 2 (N = 276), the essays were manipulated between-subjects. Credibility attribution towards the author was assessed as the dependent variable. In both studies, the expected interaction effects were not significant. In a third highly powered (registered) study (N = 1356), we used a previously validated worldview-opposing essay. The five classic worldview defense items served as the main dependent measure. The MS effect was not significant. Bayesian analyses favored the null hypothesis. An internal meta-analysis revealed a very small (Hedges’ g = .09) but nonsignificant (p = .058) effect of MS. Altogether, the presented studies reveal challenges in providing strong evidence for this established idea.
... Eine "strenge und forschungslogisch begründete Methodologie für Replikationsstudien [wurde] bislang weder angewandt noch entwickelt (Fiedler, 2018, S. 45;vgl. Asendorpf et al., 2013;Erdfelder & Ulrich, 2018 Brandt et al. (2014, S. 218) formulierten 34 für eine solide Wiederholungsstudie zu erfüllende Kriterien, die sie in ihrem "Replikationsrezept" zu fünf "Ingredienzen" bündelten: 1. "Carefully defining the effects and methods that the researcher intends to replicate" (vgl. S. 218). ...
Article
To date replications of published research results are extremely rare exceptions in (educational) psychology. The following article emphasizes the great scientific benefits and indispensability of replication studies. The question is pursued why – despite the tremendous additional benefit – almost no replication studies are published and why many research findings could not be replicated. There are manifold reasons for these issues: – The widespread (but absurd) opinion that “statistical significance” informs about the probability to replicate a research finding; – the confusion of “statistical significance” with relevance; – the bad habit to pose a tested hypothesis retrospectively (ex post), thus in knowledge of the findings of the study, but passing it off as the theoretical derived origin of the research work (i.e. formulated a priori); – the inflation of the α-error due to multiple significance testing; – exclusively reporting results which support the research hypotheses in conjunction with embezzling deviating findings; – insufficient construct validity of the measures; – fraud and deceit in science; – the traditional contempt for replications by editors, reviewers and third-party funders. All these reasons lead to the fact that almost exclusively “statistical significant” and “new” results are produced and published and – therefore – false theories persist. Some essential countermeasures are outlined: – a generous funding of replication studies and their publications; – an emphatic reviewer's acceptance of methodically adequate replication studies; – the willingness to provide sufficient space in journals for replication studies; – the appreciation of the great scientific benefit of replication studies, also in appointment procedures. Consequently, this would mean that different addressees have to be approached with the countermeasures in order to establish and promote replication studies. However, sustainable changes can only be achieved if all protagonists (researchers; reviewers; journal editors; appointment committees; third-party funders) acknowledge their individual responsibility and suit the action to the word. --------------------------- In der (Pädagogischen) Psychologie sind Replikationsstudien bislang extrem seltene Ausnahmen. Dieser Artikel legt dar, dass und warum Wiederholungsstudien unentbehrlich sind. Weiterhin wird der Frage nachgegangen, warum – trotz des enormen Mehrwerts – nahezu keine Replikationen publiziert werden und warum viele „Ergebnisse“ der psychologischen Forschung nicht replizierbar sind. Dass es sich bei diesen Sachverhalten nicht um Vermutungen handelt, wird durch vorliegende Untersuchungen belegt. Die Ursachen dafür liegen in verschiedenen – teilweise voneinander abhängigen – Ebenen des Wissenschaftssystems: – die verbreitete (aber abwegige) Ansicht, „statistische Signifikanz“ indiziere auch die Wahrscheinlichkeit, einen Befund replizieren zu können; – die Verwechslung von „statistisch signifikant“ mit relevant; – die Unsitte, getestete Untersuchungshypothesen erst im Nachhinein (ex post), also in Kenntnis der Resultate einer Studie, aufgestellt zu haben, aber in der Publikation als theoretisch abgeleiteten Ausgangspunkt (d.h. a priori formuliert) auszugeben; – die α-Fehler-Inflationierung durch multiple statistische Signifikanztestungen; – das exklusive Berichten von Ergebnissen, welche die Forschungshypothesen stützen, verbunden mit dem Unterschlagen abweichender Befunde; – unzureichende Konstruktvalidität der verwendeten Messinstrumente; – Lug und Betrug in der Wissenschaft; – die Geringschätzung von Replikationen durch Zeitschriftenherausgeber, Gutachter und Drittmittelgeber. All das führt dazu, dass fast ausschließlich „statistisch signifikante“ und „neue“ Ergebnisse veröffentlicht werden und falsche Theorien persistieren. Als Gegenmaßnahmen werden beispielhaft genannt: - eine großzügige finanzielle Förderung von Replikationsprojekten und ihrer Publikation; – die nachdrückliche gutachterliche Befürwortung der Veröffentlichung methodisch adäquater Wiederholungsstudien; – die Bereitschaft von Fachzeitschriften, dafür genug Platz bereitzustellen; – die Anerkennung des großen wissenschaftlichen Werts von Wiederholungsstudien, auch in Berufungsverfahren. Daraus ergibt sich, dass mit den aufgezeigten Möglichkeiten und Forderungen zur Etablierung und Förderung von Replikationsstudien unterschiedliche Adressaten parallel angesprochen werden müssen. Nachhaltige Veränderungen sind allerdings nur erreichbar, wenn die einzelnen Akteure (Forscher; Gutachter; Zeitschriftenherausgeber; Berufungskommissionen; Drittmittelgeber) ihre individuelle Verantwortung anerkennen und entsprechende Taten folgen lassen.
... 6 To support this argument empirically, one would have to replicate the original embarrassment manipulation and demonstrate with a manipulation check that it did not result in significant differences in embarrassment in the replication study. As Erdfelder and Ulrich (2018) argue correctly: "An exact replication can be questioned only, if .. it can be demonstrated that the experimental material at the time and place of the replication study fails to have the same psychological effect as at the time of the original study" (p. 6/7; my translation). ...
... As we mentioned above, the value of replications has been overlooked in the past yearsjournals were more into publishing fancy or sexy findings than valuing replications of established findings. Yet, a truly independent and direct (not conceptual) replication ensures that a particular effect is reproducible (Erdfelder & Ulrich, 2018), thereby adding to the importance of the effect. Therefore, a good journal has to publish methodologically sound replication studies independently of the results. ...
Chapter
Interventionsforschung in der Schule befasst sich mit der systematischen Evaluation förderlicher Maßnahmen. Damit ist die Chance verbunden, substanzielle Verbesserungen unterrichtlicher Angebote zu erreichen. Eine Voraussetzung dafür ist ein hoher methodischer Standard, um die Verlässlichkeit von Befunden zu sichern. Zur Optimierung der Passung zwischen Lernvoraussetzungen und Programmen sind differentielle Analysen erforderlich. Schließlich muss Interventionsforschung sich mit Bedingungen des Transfers und der Implementation innovativer Maßnahmen in die schulische Praxis befassen.
Article
Objective Body dissatisfaction is highly prevalent in overweight and obesity, while evidence for the efficacy of body image interventions is still scarce. This interventional pilot study investigates the efficacy and mechanisms of change of two stand-alone body image interventions in women with overweight and obesity. Methods Women with overweight and obesity (n = 76) were randomly assigned to five weekly sessions of either mirror exposure (ME) or a cognitive restructuring intervention (CR, weekly over 5 weeks) or to a wait-list control group (WCG). Primary outcome measures were self-reported and interview-based body dissatisfaction; depression, self-esteem and emotional eating served as secondary outcome measures. Experimental paradigms were used prior to and after the interventions to analyze possible mechanisms of change: (a) Implicit Associations Tests to assess weight-related attitudes (b) eye-tracking experiments to assess visual processing of body pictures and (c) a thought-sampling procedure to assess body-related cognitions and arousal. Results According to intent-to-treat analyses using linear mixed-models, both interventions lead to significant improvements in body image, while there were no changes in the WCG. Different mechanisms of change were identified. Conclusions Both types of interventions might be effective in the reduction of self-reported body dissatisfaction and interview-based shape concerns in overweight and obesity. However, as different mechanisms drive the effect, future research should clarify which individual might best benefit from which intervention.
Article
Full-text available
Bildungsentscheidungen reproduzieren soziale Ungleichheiten. Der Abiturientenanteil eines Geburtsjahrgangs steigt kontinuierlich. Zugleich entscheidet sich ein größerer Teil dieser Abiturient:innen für eine Ausbildung, manche im Anschluss daran auch für ein Studium (additive Doppelqualifizierung). Die hier präsentierte Studie geht der Frage nach, wodurch sich die Gruppe der Doppelqualifizierten charakterisieren lässt. Obwohl es hierzu bereits Forschungsbefunde gibt, finden sich in den bisherigen Arbeiten verschiedene Limitationen: Diese sind unter anderem Einschränkungen hinsichtlich bestimmter Ausbildungsberufe sowie der geographischen Verortung. Um insbesondere geographischen Engführungen zu begegnen, werden in der vorliegenden Replikationsstudie Daten des nationalen Bildungspanels verwendet (N = 310). Dies ist auch eine Limitation der Originalstudie von Pilz/Ebner/Edeling (2020), welche ebenfalls die obige Forschungsfrage zu Grunde legte, jedoch nur ausgewählte Berufe in den Blick nahm. In der vorliegenden Replikationsstudie werden ähnliche Konstrukte verwendet. Im Ergebnis zeigen sich signifikante Effekte in Bezug auf Alter, Geschlecht, Migrationshintergrund und Kulturnähe, letzteres abweichend von den Befunden von Pilz/Ebner/Edeling (2020). Als Erweiterung der Replikation wurden weitere Analysen mithilfe von logistischen Regressionsmodellen vorgenommen, welche auf den Einfluss kultureller Affektion, auch bei Kontrolle soziodemographischer Faktoren, hinweisen.
Article
Full-text available
In this comment, we report a simulation study that assesses error rates and average sample sizes required to reach a statistical decision for two sequential procedures, the sequential probability ratio test (SPRT) originally proposed by Wald (1947) and the independent segments procedure (ISP) recently suggested by Miller and Ulrich (2020). Following Miller and Ulrich (2020), we use sequential one-tailed t tests as examples. In line with the optimal efficiency properties of the SPRT already proven by Wald and Wolfowitz (1948), the SPRT outperformed the ISP in terms of efficiency without compromising error probability control. The efficiency gain in terms of sample size reduction achieved with the SPRT t test relative to the ISP may be as high as 25%. We thus recommend the SPRT as a default sequential testing procedure especially for detecting small or medium hypothesized effect sizes under H1 whenever a priori knowledge of the maximum sample size is not crucial. If a priori control of the maximum sample size is mandatory, however, the ISP is a very useful addition to the sequential testing literature. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
Article
Full-text available
Training of Word Recognition with Willy Wordbear: A Syllable-Based Reading Promotion Program for Elementary School Being able to read words fluently and accurately is an important milestone in learning to read but not all children reach it. For weak readers, it is often difficult to make the transition from letter-by-letter reading to visual word recognition through orthographic comparison processes using larger (sub-)lexical units. The syllable seems to provide a bridge to orthographic decoding for children who learn to read German. Against this background, this replication study investigated the effectiveness of a syllable-based reading training on the visual word recognition and reading comprehension of second graders in an experimental pre-post design. To this end, 101 children whose word recognition performance in a standardized reading test was below the mean value in comparison to the classroom norm were randomly assigned to the experimental group or a waiting control group. Linear models revealed significant improvements in orthographic decoding in the experimental group after completion of the 24-session small group training. Children who received the training of repeated reading and segmentation of frequent syllables were able to recognize words faster and more accurately. These findings are further evidence of the effectiveness of the training for promoting the recognition of written words.
Article
Evaluation of communication abilities after traumatic brain injury with the La Trobe Communication Questionnaire (LCQ): First results of the German replication study with neurologically healthy individuals The La Trobe communication questionnaire (LCQ) was developed to reveal post-injury changes in communicative abilities after traumatic brain injury. The German version of LCQ is the only available instrument that is psychometrically examined and also enables a multi-perspective approach including the self-perceptions of individuals as well as the perceptions of close others. The aim of this study was to replicate the findings of the original version concerning the self-perception and the perception of close others. In this paper we report the main results of n=160 neurologically healthy individuals (=80 dyads) of the replication study. By being able to replicate the main results we will show that the LCQ is a valid tool for the evaluation of perceived communication.
Article
Zusammenfassung. Theoretischer Hintergrund: Gemäß Theorien zu Essstörungen ist die Aufrechterhaltung von Body Checking (BC)-Verhalten mit zwei divergierenden kognitiv-affektiven Prozessen assoziiert, d. h. Anstieg an Arousal versus Abnahme negativer emotionaler Valenz. Fragestellung: Ziel ist die Replikation einer Online-Studie zur Untersuchung, ob BC mit beiden postulierten Prozessen einhergeht und welche Relevanz die subjektive Attraktivität der kontrollierten Körperpartien hierbei hat. Methode: 125 Frauen mit hohen versus niedrigen Figur- und Gewichtssorgen schätzten ihr Level an Arousal und negativer emotionaler Valenz in erinnerten BC-Episoden der subjektiv unattraktivsten und attraktivsten Körperpartien ein. Ergebnisse: Nur bei Frauen mit hohen Figur- und Gewichtssorgen in BC-Episoden der subjektiv unattraktivsten Körperpartien traten sowohl ein Anstieg an Arousal als auch eine Abnahme negativer emotionaler Valenz ein. Schlussfolgerungen: Beide postulierten Prozesse wurden nachgewiesen.
Article
Full-text available
Zusammenfassung Erfolgreiche Interventionen sind sowohl wirksam als auch implementierbar. Um dies zu erreichen werden idealtypisch mehrere aufeinander aufbauende Phasen der Interventionsentwicklung und -überprüfung durchlaufen. Die damit einhergehenden Standards können jedoch oft nicht uneingeschränkt eingehalten werden, wenn Studien unter Praxisbedingungen durchgeführt werden. Die vorliegende Studie ist die vierte eines Forschungsprogramms, in dem eine Intervention zur Förderung der Präsentationskompetenz (PK) von Grundschulkindern systematisch entwickelt und evaluiert wurde. Sie untersucht, ob (1) das Training wie intendiert, d. h. mit hoher Durchführungstreue, in der Praxis implementierbar ist und (2) ob sich die zuvor beobachteten Effekte auf PK und Sprechangst replizieren lassen. Zehn Kursleitungen und 65 Kinder nahmen an der cluster-randomisierten Studie mit Wartekontrollgruppe und Prä-Post-Messung teil. Die Durchführungstreue wurde über Selbsteinschätzungen der Kursleitungen, PK über Videoratings und Sprechangst über Fragebögen erfasst. Es zeigte sich, dass das Training erneut mit hoher Durchführungstreue durchgeführt wurde. Für die PK fanden sich auf zwei der 18 erfassten Präsentationsfähigkeiten signifikante Effekte: Körperspannung und persönliche Ansprache. Keine signifikanten Effekte fanden sich auf Sprechangst.
Chapter
Dieses Kapitel bietet eine Einführung in die Sozialpsychologie, indem erklärt wird, was die Sozialpsychologie ist, warum sie nicht das Gleiche ist wie Laienpsychologie und wie sie sich von benachbarten Disziplinen abgrenzen lässt. Im Anschluss daran werden im Überblick methodische Grundlagen der Sozialpsychologie dargestellt: Die Korrelationsmethode und das Experiment werden erklärt, und die statistische Absicherung von empirischen Ergebnissen wird beschrieben. Ebenso wird auf Kritik an der Experimentalmethodik sowie auf die Replizierbarkeit sozialpsychologischer Forschungsergebnisse eingegangen. Am Ende dieses Kapitels wird ein Überblick über den vorliegenden Band (Das Individuum im sozialen Kontext) sowie über ▶ Sozialpsychologie II (Der Mensch in sozialen Beziehungen) gegeben.
Article
Full-text available
Simonsohn, Nelson, and Simmons (2014a) proposed p-curve – the distribution of statistically significant p-values for a set of studies – as a tool to assess the evidential value of these studies. They argued that, whereas right-skewed p-curves indicate true underlying effects, left-skewed p-curves indicate selective reporting of significant results when there is no true effect (“p-hacking”). We first review previous research showing that, in contrast to the first claim, null effects may produce right-skewed p-curves under some conditions. We then question the second claim by showing that not only selective reporting but also selective nonreporting of significant results due to a significant outcome of a more popular alternative test of the same hypothesis may produce left-skewed p-curves, even if all studies reflect true effects. Hence, just as right-skewed p-curves do not necessarily imply evidential value, left-skewed p-curves do not necessarily imply p-hacking and absence of true effects in the studies involved.
Article
Introduction: Irrespective of the type of psychotherapy used, the abstinence-oriented treatment of drug abusers is less successful than that for alcohol abusers. If, on the other hand, the two groups are parallelized in such a way that the patients are identical with respect to the five characteristics of gender, age, schooling, work situation and partner situation, then there is no difference between the success rates of the drug and alcohol abusers. The aim of this study is to determine whether this result can be replicated in another therapeutic institution. Method: Retrospective field study of 320 abusers of illegal drugs and 320 alcohol abusers who were treated with behaviour therapy. By combining the binary characteristics gender, work situation and age, the drug-dependent patients were divided into 23=8 groups, and the same number of alcohol abusers were randomly selected for each group. The scheduled period of inpatient treatment was 90 days for the alcohol abusers and 120 days for the drug abusers. Every week the patients had one session of individual psychotherapy and four to five group therapy sessions. According to the indications, the certified behaviour therapists implemented the following interventions including behaviour analysis, relapse prevention, cognitive therapy, self-management and behavioural family therapy. Comparison of the success rates was carried out using the Chi2 test, and changes in the psychological findings were tested with one-way variance analysis. Results: There was no difference between drug and alcohol abusers with respect to the rate of therapy termination according to plan (around 80%). A total of 48% of the drug abusers and 41 % of the alcohol abusers who could be followed up had been continuously abstinent at the one-year catamnesis without a single relapse. There were also no differences between the two groups when it was assumed that the patients who could not be followed up had relapsed. In the case of both the drug and alcohol abusers the abstinence rate was highest in over-29-year-old employed men (57.6%; 48.4%). The abstinence rate was lowest in employed female drug abusers (27.8%) and young, unemployed female drug abusers (0%, n=11). Discussion: What appears to influence the abstinence rate after inpatient treatment is not only the type of substance consumed but also sociodemographic characteristics. In addition to individually tailored therapy, our results confirm the importance of a highly differentiated presentation of the outcomes of therapy in the specialist literature. An average rate of abstinence (e.g. 30%) is insufficient to evaluate an intervention unless information is also provided about the patients for which the intervention is suitable and those for which it is not. In accordance with the Reproducibility Project, we consider replication studies essential in psychotherapy, even though in practice the considerable methodical requirements can only be partially fulfilled.
Article
Full-text available
Several hundred research groups attempted replications of published effects in so-called Many Labs studies involving thousands of research participants. Given this enormous investment, it seems timely to assess what has been learned and what can be learned from this type of project. My evaluation addresses four questions: First, do these replication studies inform us about the replicability of social psychological research? Second, can replications detect fraud? Third, does the failure to replicate a finding indicate that the original result was wrong? Finally, do these replications help to support or disprove any social psychological theories? Although evidence of replication failures resulted in important methodological changes, the 2015 Open Science Collaboration findings sufficed to make the point. To assess the state of social psychology, we have to evaluate theories rather than randomly selected research findings.
Article
Zusammenfassung. Fragestellung: Gestatten die Merkmale „Anzahl vorheriger Entgiftungen“, „Depressivität“ und „Selbstwirksamkeitserwartung, SWE“ eine Prognose der Abstinenz zur Ein-Jahres Katamnese? Methode: Prospektive Replikations-Feldstudie in einer anderen Klinik, in der mittels binärer logistischer Regression und Chi Quadrat Tests Unterschiede in Patienten-Merkmalen zwischen zur Ein-Jahres-Katamnese durchgehend abstinenten (N = 285) vs. rückfällig gewordenen Alkoholabhängigen (N = 274) analysiert wurden. Ergebnisse: Ebenso wie in unserer vorherigen Studie waren Alter, Geschlecht, Schulbildung, Arbeitslosigkeit, Familienstand, Partnersituation, suchtspezifische und psychische Komorbidität prognostisch nicht relevant, Ausnahme Persönlichkeitsstörungen. Wiederum hatten Patienten mit weniger als zwei Entgiftungen und einer hohen SWE die höchste Wahrscheinlichkeit ein Jahr durchgehend abstinent zu leben (82 %). Ebenso bestätigt wurde, dass Verbesserungen in der psychischen Belastung nicht mit Abstinenz korrelieren. Depressivität und vorherige Entwöhnungsbehandlungen wurden nicht als Prädiktoren repliziert. Schlussfolgerungen: Vorherige Entgiftungen, SWE und Persönlichkeitsstörungen könnten für den hier untersuchten Kliniktyp allgemeinverbindliche Prädiktoren sein. Inwiefern der Ausschluss der beiden Prädiktoren Depressivität und vorherige Entwöhnungsbehandlungen eine Folge neu eingeführter Interventionen für Depressive und Therapieerfahrene ist, wäre zu prüfen. Wie in unserer vorherigen Studie gestattet das Regressionsmodell trotz geringer Varianzaufklärung und mittlerer Effektstärken die Ableitung kausaler Hypothesen zur klinikspezifischen Verbesserung der Behandlung. Replikationsstudien sollten ebenso wie die empirische Orientierung ein fester Bestandteil verhaltenstherapeutischer Behandlung sein.
Article
Full-text available
Publication bias hampers the estimation of true effect sizes. Specifically, effect sizes are systematically overestimated when studies report only significant results. In this paper we show how this overestimation depends on the true effect size and on the sample size. Furthermore, we review and follow up methods originally suggested by Hedges (1984), Iyengar and Greenhouse (1988), and Rust, Lehmann, and Farley (1990) allowing the estimation of the true effect size from published test statistics (e.g., from the t-values of reported significant results). Moreover, we adapted these methods allowing meta-analysts to estimate the percentage of researchers who consign undesired results in a research domain to the file drawer. We also apply the same logic to the case when significant results tend to be underreported. We demonstrate the application of these procedures for conventional one-sample and two-sample t-tests. Finally, we provide R and MATLAB versions of a computer program to estimate the true unbiased effect size and the prevalence of publication bias in the literature.
Article
Full-text available
p-curves provide a useful window for peeking into the file drawer in a way that might reveal p-hacking (Simonsohn, Nelson, & Simmons, 2014a). The properties of p-curves are commonly investigated by computer simulations. On the basis of these simulations, it has been proposed that the skewness of this curve can be used as a diagnostic tool to decide whether the significant p values within a certain domain of research suggest the presence of p-hacking or actually demonstrate that there is a true effect. Here we introduce a rigorous mathematical approach that allows the properties of p-curves to be examined without simulations. This approach allows the computation of a p-curve for any statistic whose sampling distribution is known and thereby allows a thorough evaluation of its properties. For example, it shows under which conditions p-curves would exhibit the shape of a monotone decreasing function. In addition, we used weighted distribution functions to analyze how 2 different types of publication bias (i.e., cliff effects and gradual publication bias) influence the shapes of p-curves. The results of 2 survey experiments with more than 1,000 participants support the existence of a cliff effect at p = .05 and also suggest that researchers tend to be more likely to recommend submission of an article as the level of statistical significance increases beyond this p level. This gradual bias produces right-skewed p-curves mimicking the existence of real effects even when no such effects are actually present.
Article
Full-text available
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically significant results and D = 0.24 (0.11–0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Article
Full-text available
Significance Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of a number of fMRI studies and may have a large impact on the interpretation of weakly significant neuroimaging results.
Article
Full-text available
Outcome knowledge influences recall of earlier predictions of the event in question. Researchers have hypothesized that age-related declines in inhibitory control may underlie older adults’ increased susceptibility to the two underlying bias processes that contribute to this hindsight bias (HB) phenomenon, recollection bias and reconstruction bias. Indeed, Coolin et al. (2015) found that older adults with lower inhibitory control were less likely to recall their earlier predictions in the presence of outcome knowledge (lower recollection ability) and were more likely to be biased by outcome knowledge when reconstructing their forgotten predictions (higher reconstruction bias) than those with higher inhibitory control. In the present study, we assess intraindividual differences in older adults’ recollection and reconstruction processes using a within-subjects manipulation of inhibition. We tested 80 older adults (Mage = 71.40, range = 65 to 87) to assess whether (a) experimentally increasing inhibition burden via outcome rehearsal during the HB task impacts the underlying HB processes, and (b) the effects of this outcome rehearsal manipulation on the underlying HB processes vary with individual differences in cognitive abilities. Our findings revealed that outcome rehearsal increased recollection bias independently of individuals’ cognitive abilities. Conversely, outcome rehearsal only increased reconstruction bias in individuals with higher inhibitory control, resulting in these individuals performing similarly to individuals with lower inhibitory control. These observations support the role of inhibitory control in older adults’ HB and suggest that even individuals with higher inhibition ability are susceptible to HB when processing resources are limited.
Article
Full-text available
There is considerable current debate about the need for replication in the science of social psychology. Most of the current discussion and approbation is centered on direct or exact replications, the attempt to conduct a study in a manner as close to the original as possible. We focus on the value of conceptual replications, the attempt to test the same theoretical process as an existing study, but that uses methods that vary in some way from the previous study. The tension between the two kinds of replication is a tension of values—exact replications value confidence in operationalizations; their requirement tends to favor the status quo. Conceptual replications value confidence in theory; their use tends to favor rapid progress over ferreting out error. We describe the many ways in which conceptual replications can be superior to direct replications. We further argue that the social system of science is quite robust to these threats and is self-correcting.
Article
Full-text available
Simonsohn, Nelson, and Simmons (2014) have suggested a novel test to detect p-hacking in research, that is, when researchers report excessive rates of "significant effects" that are truly false positives. Although this test is very useful for identifying true effects in some cases, it fails to identify false positives in several situations when researchers conduct multiple statistical tests (e.g., reporting the most significant result). In these cases, p-curves are right-skewed, thereby mimicking the existence of real effects even if no effect is actually present.
Article
Full-text available
Empirically analyzing empirical evidence One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study. Science , this issue 10.1126/science.aac4716
Article
Full-text available
Statistical power depends on the size of the effect of interest. However, effect sizes are rarely fixed in psychological research: Study design choices, such as the operationalization of the dependent variable or the treatment manipulation, the social context, the subject pool, or the time of day, typically cause systematic variation in the effect size. Ignoring this between-study variation, as standard power formulae do, results in assessments of power that are too optimistic. Consequently, when researchers attempting replication set sample sizes using these formulae, their studies will be underpowered and will thus fail at a greater than expected rate. We illustrate this with both hypothetical examples and data on several well-studied phenomena in psychology. We provide formulae that account for between-study variation and suggest that researchers set sample sizes with respect to our generally more conservative formulae. Our formulae generalize to settings in which there are multiple effects of interest. We also introduce an easy-to-use website that implements our approach to setting sample sizes. Finally, we conclude with recommendations for quantifying between-study variation. © The Author(s) 2014.
Article
Full-text available
One of the most important issues in structural equation modeling concerns testing model fit. We propose to retain the likelihood ratio test in combination with decision criteria that increase with sample size. Specifically, rooted in Neyman-Pearson hypothesis testing, we advocate balancing α- and β-error risks. This strategy has a number of desirable consequences and addresses several objections that have been raised against the likelihood ratio test in model evaluation. First, balancing error risks avoids logical problems with Fisher-type hypotheses tests when predicting the null hypothesis (i.e., model fit). Second, both types of statistical decision errors are controlled. Third, larger samples are encouraged (rather than penalized) because both error risks diminish as the sample size increases. Finally, the strategy addresses the concern that structural equation models cannot necessarily be expected to provide an exact description of real-world phenomena.
Article
Full-text available
An essential first step in planning a confirmatory or a replication study is to determine the sample size necessary to draw statistically reliable inferences using power analysis. A key problem, however, is that what is available is the sample-size estimate of the effect size, and its use can lead to severely underpowered studies when the effect size is overestimated. As a potential remedy, we introduce safeguard power analysis, which uses the uncertainty in the estimate of the effect size to achieve a better likelihood of correctly identifying the population effect size. Using a lower-bound estimate of the effect size, in turn, allows researchers to calculate a sample size for a replication study that helps protect it from being underpowered. We show that in most common instances, compared with nominal power, safeguard power is higher whereas standard power is lower. We additionally recommend the use of safeguard power analysis to evaluate the strength of the evidence provided by the original study. © The Author(s) 2014.
Article
Full-text available
There has been increasing criticism of the way psychologists conduct and analyze studies. These critiques as well as failures to replicate several high-profile studies have been used as justification to proclaim a "replication crisis" in psychology. Psychologists are encouraged to conduct more "exact" replications of published studies to assess the reproducibility of psychological research. This article argues that the alleged "crisis of replicability" is primarily due to an epistemological misunderstanding that emphasizes the phenomenon instead of its underlying mechanisms. As a consequence, a replicated phenomenon may not serve as a rigorous test of a theoretical hypothesis because identical operationalizations of variables in studies conducted at different times and with different subject populations might test different theoretical constructs. Therefore, we propose that for meaningful replications, attempts at reinstating the original circumstances are not sufficient. Instead, replicators must ascertain that conditions are realized that reflect the theoretical variable(s) manipulated (and/or measured) in the original study. © The Author(s) 2013.
Article
Full-text available
The published journal article is the primary means of communicating scientific ideas, methods, and empirical data. Not all ideas and data get published. In the present scientific culture, novel and positive results are considered more publishable than replications and negative results. This creates incentives to avoid or ignore replications and negative results, even at the expense of accuracy (Giner-Sorolla, 2012; Nosek, Spies, & Motyl, 2012). As a consequence, replications (Makel, Plucker, & Hegarty, 2012) and negative results (Fanelli, 2010; Sterling, 1959) are rare in the published literature. This insight is not new, but the culture is resistant to change. This article introduces the first known journal issue in any discipline consisting exclusively of preregistered replication studies. It demonstrates that replications have substantial value, and that incentives can be changed. (PsycINFO Database Record (c) 2014 APA, all rights reserved)
Article
Full-text available
In this article we give an elementary introduction to optimal design for some basic statistical models. Research questions coming from a project on dyscalculia are used as illustrative examples throughout this article. First, basic design issues are considered for the t-test and simple regression and then extended to the general linear model. Finally, we will outline optimal designs for nonlinear statistical models. As a simple example, logistic regression with a binary explanatory variable is used. (PsycINFO Database Record (c) 2014 APA, all rights reserved)
Article
Full-text available
Recent controversies have questioned the quality of scientific practice in the field of psychology, but these concerns are often based on anecdotes and seemingly isolated cases. To gain a broader perspective, this article applies an objective test for excess success to a large set of articles published in the journal Psychological Science between 2009 and 2012. When empirical studies succeed at a rate much higher than is appropriate for the estimated effects and sample sizes, readers should suspect that unsuccessful findings have been suppressed, the experiments or analyses were improper, or the theory does not properly account for the data. In total, problems appeared for 82 % (36 out of 44) of the articles in Psychological Science that had four or more experiments and could be analyzed.
Article
Full-text available
Psychological scientists have recently started to reconsider the importance of close replications in building a cumulative knowledge base; however, there is no consensus about what constitutes a convincing close replication study. To facilitate convincing close replication attempts we have developed a Replication Recipe, outlining standard criteria for a convincing close replication. Our Replication Recipe can be used by researchers, teachers, and students to conduct meaningful replication studies and integrate replications into their scholarly habits.
Article
Full-text available
Because scientists tend to report only studies (publication bias) or analyses (p-hacking) that “work,” readers must ask, “Are these effects true, or do they merely reflect selective reporting?” We introduce p-curve as a way to answer this question. P-curve is the distribution of statistically significant p values for a set of studies (ps < .05). Because only true effects are expected to generate right-skewed p-curves—containing more low (.01s) than high (.04s) significant p values—only right-skewed p-curves are diagnostic of evidential value. By telling us whether we can rule out selective reporting as the sole explanation for a set of findings, p-curve offers a solution to the age-old inferential problems caused by file-drawers of failed studies and analyses.
Article
Full-text available
The present article suggests a possible way to reduce the file drawer problem in scientific research (Rosenthal, 1978, 1979), that is, the tendency for “nonsignificant” results to remain hidden in scientists’ file drawers because both authors and journals strongly prefer statistically significant results. We argue that peer-reviewed journals based on the principle of rigorous evaluation of research proposals before results are known would address this problem successfully. Even a single journal adopting a result-blind evaluation policy would remedy the persisting problem of publication bias more efficiently than other tools and techniques suggested so far. We also propose an ideal editorial policy for such a journal and discuss pragmatic implications and potential problems associated with this policy. Moreover, we argue that such a journal would be a valuable addition to the scientific publication outlets, because it supports a scientific culture encouraging the publication of well-designed and technically sound empirical research irrespective of the results obtained. Finally, we argue that such a journal would be attractive for scientists, publishers, and research agencies.
Article
Full-text available
The veracity of substantive research claims hinges on the way experimental data are collected and analyzed. In this article, we discuss an uncomfortable fact that threatens the core of psychology’s academic enterprise: almost without exception, psychologists do not commit themselves to a method of data analysis before they see the actual data. It then becomes tempting to fine tune the analysis to the data in order to obtain a desired result—a procedure that invalidates the interpretation of the common statistical tests. The extent of the fine tuning varies widely across experiments and experimenters but is almost impossible for reviewers and readers to gauge. To remedy the situation, we propose that researchers preregister their studies and indicate in advance the analyses they intend to conduct. Only these analyses deserve the label “confirmatory,” and only for these analyses are the common statistical tests valid. Other analyses can be carried out but these should be labeled “exploratory.” We illustrate our proposal with a confirmatory replication attempt of a study on extrasensory perception.
Article
Full-text available
Two experiments (modeled after J. Deese's 1959 study) revealed remarkable levels of false recall and false recognition in a list learning paradigm. In Exp 1, Ss studied lists of 12 words (e.g., bed, rest, awake); each list was composed of associates of 1 nonpresented word (e.g., sleep). On immediate free recall tests, the nonpresented associates were recalled 40% of the time and were later recognized with high confidence. In Exp 2, a false recall rate of 55% was obtained with an expanded set of lists, and on a later recognition test, Ss produced false alarms to these items at a rate comparable to the hit rate. The act of recall enhanced later remembering of both studied and nonstudied material. The results reveal a powerful illusion of memory: People remember events that never happened. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
For any given research area, one cannot tell how many studies have been conducted but never reported. The extreme view of the "file drawer problem" is that journals are filled with the 5% of the studies that show Type I errors, while the file drawers are filled with the 95% of the studies that show nonsignificant results. Quantitative procedures for computing the tolerance for filed and future null results are reported and illustrated, and the implications are discussed. (15 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Maintains that many psychological investigations are accused of failure to generalize to the real world because of sample bias or artificiality of setting. It is argued in this article that such generalizations often are not intended. Rather than making predictions about the real world from the laboratory, it is possible to test predictions that specify what ought to happen in the lab. Even "artificial" findings may be regarded as interesting because they show what can occur, even if it rarely does; or, where generalizations are made, they may have added force because of artificiality of sample or setting. A misplaced preoccupation with external validity can lead to dismissing good research for which generalization to real life is not intended or meaningful. (18 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Cohen (1962) pointed out the importance of statistical power for psychology as a science, but statistical power of studies has not increased, while the number of studies in a single article has increased. It has been overlooked that multiple studies with modest power have a high probability of producing nonsignificant results because power decreases as a function of the number of statistical tests that are being conducted (Maxwell, 2004). The discrepancy between the expected number of significant results and the actual number of significant results in multiple-study articles undermines the credibility of the reported results, and it is likely that questionable research practices have contributed to the reporting of too many significant results (Sterling, 1959). The problem of low power in multiple-study articles is illustrated using Bem's (2011) article on extrasensory perception and Gailliot et al.'s (2007) article on glucose and self-regulation. I conclude with several recommendations that can increase the credibility of scientific evidence in psychological journals. One major recommendation is to pay more attention to the power of studies to produce positive results without the help of questionable research practices and to request that authors justify sample sizes with a priori predictions of effect sizes. It is also important to publish replication studies with nonsignificant results if these studies have high power to replicate a published finding. (PsycINFO Database Record (c) 2012 APA, all rights reserved).
Preprint
solid sciencetheoretical advancesteaching opportunities + -----------------------------Rejoice!This paper was published in European Journal of Personality:Ijzerman, H., Brandt, M. J., & Van Wolferen, J. (2013). Rejoice! In replication. European Journal of Personality, 27(2), 128-129.
Article
The sample size necessary to obtain a desired level of statistical power depends in part on the population value of the effect size, which is, by definition, unknown. A common approach to sample-size planning uses the sample effect size from a prior study as an estimate of the population value of the effect to be detected in the future study. Although this strategy is intuitively appealing, effect-size estimates, taken at face value, are typically not accurate estimates of the population effect size because of publication bias and uncertainty. We show that the use of this approach often results in underpowered studies, sometimes to an alarming degree. We present an alternative approach that adjusts sample effect sizes for bias and uncertainty, and we demonstrate its effectiveness for several experimental designs. Furthermore, we discuss an open-source R package, BUCSS, and user-friendly Web applications that we have made available to researchers so that they can easily implement our suggested methods.
Article
Null hypothesis significance testing (NHST) has been the subject of debate for decades and alternative approaches to data analysis have been proposed. This article addresses this debate from the perspective of scientific inquiry and inference. Inference is an inverse problem and application of statistical methods cannot reveal whether effects exist or whether they are empirically meaningful. Hence, raising conclusions from the outcomes of statistical analyses is subject to limitations. NHST has been criticized for its misuse and the misconstruction of its outcomes, also stressing its inability to meet expectations that it was never designed to fulfil. Ironically, alternatives to NHST are identical in these respects, something that has been overlooked in their presentation. Three of those alternatives are discussed here (estimation via confidence intervals and effect sizes, quantification of evidence via Bayes factors, and mere reporting of descriptive statistics). None of them offers a solution to the problems that NHST is purported to have, all of them are susceptible to misuse and misinterpretation, and some bring around their own problems (e.g., Bayes factors have a one-to-one correspondence with p values, but they are entirely deprived of an inferential framework). Those alternatives also fail to cover a broad area of inference not involving distributional parameters, where NHST procedures remain the only (and suitable) option. Like knives or axes, NHST is not inherently evil; only misuse and misinterpretation of its outcomes needs to be eradicated.
Article
In this article, we present a model for determining how total research payoff depends on researchers’ choices of sample sizes, α levels, and other parameters of the research process. The model can be used to quantify various trade-offs inherent in the research process and thus to balance competing goals, such as (a) maximizing both the number of studies carried out and also the statistical power of each study, (b) minimizing the rates of both false positive and false negative findings, and (c) maximizing both replicability and research efficiency. Given certain necessary information about a research area, the model can be used to determine the optimal values of sample size, statistical power, rate of false positives, rate of false negatives, and replicability, such that overall research payoff is maximized. More specifically, the model shows how the optimal values of these quantities depend upon the size and frequency of true effects within the area, as well as the individual payoffs associated with particular study outcomes. The model is particularly relevant within current discussions of how to optimize the productivity of scientific research, because it shows which aspects of a research area must be considered and how these aspects combine to determine total research payoff.
Article
Zusammenfassung. In letzter Zeit mehren sich Hinweise darauf, dass gehauft falsch-positive Befunde in wissenschaftlichen Publikationen berichtet werden und so die Forschungsliteratur ein verzerrtes Bild der Realitat widerspiegelt. Das Fachkollegium Psychologie der Deutschen Forschungsgemeinschaft hat dieses Problem aufgegriffen und die moglichen Ursachen von falsch-positiv Befunden diskutiert. Dieser Artikel gibt den Inhalt dieser Diskussion wieder und mochte Antragssteller auffordern, diese Problematik bei Forschungsantragen starker zu beachten. Auch appellieren wir an Antragsteller, Gutachter und Herausgeber, den Stellenwert von negativen Befunden sowie von Replikationen bei Forschungsantragen und wissenschaftlichen Arbeiten einschlieslich klinischer Studien starker zu berucksichtigen.
Article
Simonsohn (2015) proposed to use effect sizes of high powered replications to evaluate whether lower powered original studies could have obtained the reported effect. His focus on sample size misses that effect size comparisons are informative with regard to a theoretical question only when the replications (i) successfully realize the theoretical variable of interest, which (ii) usually requires supporting evidence from a manipulation check that should (iii) also indicate that the manipulations were of comparable strength. Because psychological phenomena are context sensitive (iv) the context of data collection should be similar and (v) the measurement procedures comparable across studies. (vi) Larger samples are often more diverse in terms of demographics and individual differences, which can further affect effect size estimates. Without attention to these points, high powered replications do not allow inferences about whether lower powered original studies could observe what they reported.
Article
psychological science, the field, continue to struggle with the challenge of establishing interesting and important and replicable phenomena. As I often tell my students, “If scientific psychology was easy, everyone would do it.” We can take some comfort in knowing that other sciences, too, face similar challenges (e.g., Begley & Ellis, 2012). But our business is with psychology. In August of this year, Science published a fascinating article by Brian Nosek and 269 coauthors (Open Science Collaboration, 2015). They reported direct replication attempts of 100 experiments published in prestigious psychology journals in 2008, including experiments reported in 39 articles in Psychological Science. Although I expect there is room to critique some of the replications, the article strikes me as a terrific piece of work, and I recommend reading it (and giving it to students). For each experiment, researchers prespecified a benchmark finding. On average, the replications had statistical power of .90+ to detect effects of the sizes obtained in the original studies, but fewer than half of them yielded a statistically significant effect. As Nosek and his coauthors made clear, even ideal replications of ideal studies are expected to fail some of the time (Francis, 2012), and failure to replicate a previously observed effect can arise from differences between the original and replication studies and hence do not necessarily indicate flaws in the original study (Maxwell, Lau, & Howard, 2015; Stroebe & Strack, 2014). Still, it seems likely that psychology journals have too often reported spurious effects arising from Type I errors (e.g., Francis, 2014).... Language: en
Article
IntroductionIndividual studiesThe summary effectHeterogeneity of effect sizesSummary points
Article
The increasing cost of research means that scientists are in more urgent need of optimal design theory to increase the efficiency of parameter estimators and the statistical power of their tests. The objectives of a good design are to provide interpretable and accurate inference at minimal costs. Optimal design theory can help to identify a design with maximum power and maximum information for a statistical model and, at the same time, enable researchers to check on the model assumptions. This Book: Introduces optimal experimental design in an accessible format. Provides guidelines for practitioners to increase the efficiency of their designs, and demonstrates how optimal designs can reduce a study's costs. Discusses the merits of optimal designs and compares them with commonly used designs. Takes the reader from simple linear regression models to advanced designs for multiple linear regression and nonlinear models in a systematic manner. Illustrates design techniques with practical examples from social and biomedical research to enhance the reader's understanding. Researchers and students studying social, behavioural and biomedical sciences will find this book useful for understanding design issues and in putting optimal design ideas to practice.
Article
Reproducibility is a defining feature of science. However, because of strong incentives for innovation and weak incentives for confirmation, direct replication is rarely practiced or published. The Reproducibility Project is an open, large-scale, collaborative effort to systematically examine the rate and predictors of reproducibility in psychological science. So far, 72 volunteer researchers from 41 institutions have organized to openly and transparently replicate studies published in three prominent psychological journals in 2008. Multiple methods will be used to evaluate the findings, calculate an empirical rate of replication, and investigate factors that predict reproducibility. Whatever the result, a better understanding of reproducibility will ultimately improve confidence in scientific methodology and findings.
Article
Several influential publications have sensitized the community of behavioral scientists to the dangers of inflated effects and false-positive errors leading to the unwarranted publication of nonreplicable findings. This issue has been related to prominent cases of data fabrication and survey results pointing to bad practices in empirical science. Although we concur with the motives behind these critical arguments, we note that an isolated debate of false positives may itself be misleading and counter-productive. Instead, we argue that, given the current state of affairs in behavioral science, false negatives often constitute a more serious problem. Referring to Wason's (1960) seminal work on inductive reasoning, we show that the failure to assertively generate and test alternative hypotheses can lead to dramatic theoretical mistakes, which cannot be corrected by any kind of rigor applied to statistical tests of the focal hypotheses. We conclude that a scientific culture rewarding strong inference (Platt, 1964) is more likely to see progress than a culture preoccupied with tightening its standards for the mere publication of original findings. © The Author(s) 2012.
Article
We discuss three arguments voiced by scientists who view the current outpouring of concern about replicability as overblown. The first idea is that the adoption of a low alpha level (e.g., 5%) puts reasonable bounds on the rate at which errors can enter the published literature, making false-positive effects rare enough to be considered a minor issue. This, we point out, rests on statistical misunderstanding: The alpha level imposes no limit on the rate at which errors may arise in the literature (Ioannidis, 2005b). Second, some argue that whereas direct replication attempts are uncommon, conceptual replication attempts are common-providing an even better test of the validity of a phenomenon. We contend that performing conceptual rather than direct replication attempts interacts insidiously with publication bias, opening the door to literatures that appear to confirm the reality of phenomena that in fact do not exist. Finally, we discuss the argument that errors will eventually be pruned out of the literature if the field would just show a bit of patience. We contend that there are no plausible concrete scenarios to back up such forecasts and that what is needed is not patience, but rather systematic reforms in scientific practice. © The Author(s) 2012.
Article
Replicability of findings is at the heart of any empirical science. The aim of this article is to move the current replicability debate in psychology towards concrete recommendations for improvement. We focus on research practices but also offer guidelines for reviewers, editors, journal management, teachers, granting institutions, and university promotion committees, highlighting some of the emerging and existing practical solutions that can facilitate implementation of these recommendations. The challenges for improving replicability in psychological science are systemic. Improvement can occur only if changes are made at many levels of practice, evaluation, and reward.
Article
Functional magnetic resonance imaging (fMRI) studiesofemotion, personality, and social cognition have drawn much attention in recent years, with high-profile studies frequently reporting extremely high (e.g., >.8) correlations between brain activation and personality measures. We show that these correlations are higher than should be expected given the (evidently limited) reliability of both fMRI and personality measures. The high correlations are all the more puzzling because method sections rarely contain much detail about how the correlations were obtained. We surveyed authors of 55 articles that reported findings of this kind to determine a few details on how these correlations were computed. More than half acknowledged using a strategy that computes separate correlations for individual voxels and reports means of only those voxels exceeding chosen thresholds. We show how this nonindependent analysis inflates correlations while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample. In addition, we argue that, in some cases, other analysis problems likely created entirely spurious correlations. We outline how the data from these studies could be reanalyzed with unbiased methods to provide accurate estimates of the correlations in question and urge authors to perform such reanalyses. The underlying problems described here appear to be common in fMRI research of many kinds-not just in studies of emotion, personality, and social cognition. © 2009 Association for Psychological Science.
Article
A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.
Article
A recent set of articles in Perspectives on Psychological Science discussed inflated correlations between brain measures and behavioral criteria when measurement points (voxels) are deliberately selected to maximize criterion correlations (the target article was Vul, Harris, Winkielman, & Pashler, 2009). However, closer inspection reveals that this problem is only a special symptom of a broader methodological problem that characterizes all paradigmatic research, not just neuroscience. Researchers not only select voxels to inflate effect size, they also select stimuli, task settings, favorable boundary conditions, dependent variables and independent variables, treatment levels, moderators, mediators, and multiple parameter settings in such a way that empirical phenomena become maximally visible and stable. In general, paradigms can be understood as conventional setups for producing idealized, inflated effects. Although the feasibility of representative designs is restricted, a viable remedy lies in a reorientation of paradigmatic research from the visibility of strong effect sizes to genuine validity and scientific scrutiny. © The Author(s) 2011.
Article
The ideas of this book have grown from the soil of a philosophic movement which, though confined to small groups, is spread over the whole world. American pragmatists and behaviorists, English logistic epistemologists, Austrian positivists, German representatives of the analysis of science, and Polish logisticians are the main groups to which is due the origin of that philosophic movement which we now call "logistic empiricism." It is the intention of uniting both the empiricist conception of modern science and the formalistic conception of logic, such as expressed in logistic, which marks the working program of this philosophic movement. Since this book is written with the same intentions, it may be asked how such a new attempt at a foundation of logistic empiricism can be justified. Many things indeed will be found in this book which have been said before by others, such as the physicalistic conception of language and the importance attributed to linguistic analysis, the connection of meaning and verifiability, and the behavioristic conception of psychology. This fact may in part be justified by the intention of giving a report of those results which may be considered today as a secured possession of the philosophic movement described; however, this is not the sole intention. If the present book enters once more into the discussion of these fundamental problems, it is because former investigations did not sufficiently take into account one concept which penetrates into all the logical relations constructed in these domains: that is, the concept of probability. It is the intention of this book to show the fundamental place which is occupied in the system of knowledge by this concept and to point out the consequences involved in a consideration of the probability character of knowledge. It is this combination of the results of my investigations on probability with the ideas of an empiricist and logistic conception of knowledge which I here present as my contribution to the discussion of logistic empiricism. The growth of this movement seems to me sufficiently advanced to enter upon a level of higher approximation; and what I propose is that the form of this new phase should be a probabilistic empiricism. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Contends that when analyzing qualitative data by means of log-linear or logit-linear models, the acceptance of a given model cannot be justified statistically if only the so-called alpha error is controlled. A method for controlling the beta error in the case of chi-square goodness of fit tests is given. The problem of cumulating error probabilities in multiple tests of hierarchical log-linear models is discussed, and strategies for adjusting for probabilities of error are recommended. (English abstract) (36 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Replication of empirical findings plays a fundamental role in science. Among experimental psychologists, successful replication enhances belief in a finding, while a failure to replicate is often interpreted to mean that one of the experiments is flawed. This view is wrong. Because experimental psychology uses statistics, empirical findings should appear with predictable probabilities. In a misguided effort to demonstrate successful replication of empirical findings and avoid failures to replicate, experimental psychologists sometimes report too many positive results. Rather than strengthen confidence in an effect, too much successful replication actually indicates publication bias, which invalidates entire sets of experimental findings. Researchers cannot judge the validity of a set of biased experiments because the experiment set may consist entirely of type I errors. This article shows how an investigation of the effect sizes from reported experiments can test for publication bias by looking for too much successful replication. Simulated experiments demonstrate that the publication bias test is able to discriminate biased experiment sets from unbiased experiment sets, but it is conservative about reporting bias. The test is then applied to several studies of prominent phenomena that highlight how publication bias contaminates some findings in experimental psychology. Additional simulated experiments demonstrate that using Bayesian methods of data analysis can reduce (and in some cases, eliminate) the occurrence of publication bias. Such methods should be part of a systematic process to remove publication bias from experimental psychology and reinstate the important role of replication as a final arbiter of scientific findings.
Article
That an experimenter can very easily influence his subjects to give him the response he wants is a problem that every investigator recognizes and takes precautions to avoid. But how does one cope with the problem of unconscious influence? It is possible that a good many contradictory or unexpected findings are due to the fact that the experimenter unknowingly communicated his desires or expectations to his subjects. Though this problem has been generally recognized and much discussed, there has heretofore been no systematic test of the hypothesis that an experimenter can obtain from his subjects the data he expects or wants to obtain. This paper reports just such a test, and discusses what can be done to avoid experimenter influence, both conscious and unconscious.