Article

Statistical significance and scientific misconduct: improving the style of the published research paper

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

A science, business, or law that is basing its validity on the level of p-values, t statistics and other tests of statistical significance is looking less and less relevant and more and more unethical. Today’s economist uses a lot of wit putting a clever index of opportunity cost into his models; but then, like the amnesiac, he fails to see opportunity cost in statistical estimates he makes of those same models. Medicine, psychology, pharmacology and other fields are similarly damaged by this fundamental error of science, keeping bad treatments on the market and good ones out. A few small changes to the style of the published research paper using statistical methods can bring large beneficial effects to more than academic research papers. It is suggested that misuse of statistical significance be added to the definition of scientific misconduct currently enforced by the NIH, NSF, Office of Research Integrity and others.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Allegedly this situation has prompted a 'file drawer problem' where researchers' filing cabinets are believed to be chuck-full of insignificant findings (Simonsohn et al., 2014); consequently, failure to report non-supported hypotheses may lead others to continue their efforts to test the very same hypotheses in subsequent research (Bedeian, Taylor, & Miller, 2010;Garud, 2015;Greenwald, 1975). In the end, false-positives may potentially emerge from the continuous efforts in trying to prove the same hypotheses thereby causing misrepresentation of reality, as reality would predominantly consist of significant findings (Ioannidis, 2008) Statistical significance is not always a prerequisite, nor is it ever sufficient for establishing research findings as scientifically valuable or meaningful (Gigerenzer & Marewski, 2015;Lindsay, 1994;Ziliak, 2016). This is due to statistical significance not being equivalent to scientific, human or economic significance (Wasserstein & Lazar, 2016); the basis for publication should therefore not be statistical significance but rather the scientific significance of a finding. ...
... In an effort to oppose the dominance of NHST, the editors of Basic and Applied Social Psychology decided to ban P values (Wasserstein & Lazar, 2016) and it is not the first time in history that a scientific journal has tried an approach of banning P value. The New England Journal of Medicine (in the 1970s), Epidemiology and the American Journal of Public Health (in the 1990s) and the Publication Manual of the American Psychological Association have all experimented with bans (Ziliak, 2016). The common thread of these claims and bans is that we, as researchers, must be able to argue for the scientific significance of our findings (Ziliak & McCloskey, 2004a, 2004b, instead of indulging in what is described by Gigerenzer and Marewski (2015) as mindless statistical inference. ...
... "Sign econometrics" is about stating the direction of the coefficient but not its size (McCloskey & Ziliak, 1996;Ziliak & McCloskey, 2004b). However, 'sign' is not economically significant unless the magnitude is large enough to matter, and statistical significance does not indicate whether it is large or small (Carver, 1993;Sullivan & Feinn, 2012;Wasserstein & Lazar, 2016;Ziliak, 2016). Low P values do not necessarily imply large or more important effects (Wasserstein & Lazar, 2016). ...
Article
A recent paper in Management Accounting Research (MAR) claimed that the validity of positivistic management accounting research (PMAR) has increased significantly during the last four decades. We argue that this is a misrepresentation of reality as the current crisis of irreproducible statistical findings is not addressed. The reliability and validity of statistical findings are under an increasing pressure due to the phenomenon of Questionable Research Practices (QRPs). It is a phenomenon argued to increase the ratio of false-positives through a distortion of the hypothetico-deductive method in favour of a researcher’s own hypothesis. This phenomenon is known to be widespread in the social sciences. We therefore conduct a meta-analysis on susceptibility of QRPs on the publication practices of PMAR, and our findings give rise to reasons for concern as there are indications of a publication practice that (unintentionally) incentivises the use of QRPs. It is therefore rational to assume that the ratio of false-positives is well-above the conventional five-per cent ratio. To break the bad equilibrium of QRPs, we suggest three different solutions and discuss their practical viability.
... This is arguably merit of the long tradition of the legal community in approaching with caution single "absolute" sources of certainty of any type-statistical significance testing undoubtedly and erroneously claiming to be one-and instead weighing the entire body of evidence in favor and against a specific thesis in a more balanced and nuanced way. A recent example of such a cautious and thoughtful approach, somehow even become a paradigm, can be seen in the 2010 case Matrixx Initiatives, Inc. v. Syracusan [24], a seminal decision by United States Supreme Court that has been widely commended and appreciated even beyond the legal circuit [25][26][27][28][29]. The case, involving the pharmaceutical company Matrixx Initiatives, centered on the question of "whether a plaintiff can state a claim for securities fraud based on a pharmaceutical company's failure to disclose reports of adverse events associated with a product" if the reports did not contain statistically significant evidence that the adverse effects may be caused by the use of the product [24]. ...
... dismissing a key role of null hypothesis testing according to Fisher's rule in establishing (and refusing) proof of causation. Unsurprisingly, many scholars have expressed appreciation for this highly relevant opinion, thus indicating how public law theory can take on board a correct approach in dealing with a highly specific and "sophisticated" statistical concept such as statistical significance/null hypothesis testing [25][26][27][28][29]. This comes as no surprise, however, since the issues raised in this seminal sentence by the Supreme Court have long been known to the public law scholarship, as comprehensively illustrated in a relevant paper by David Kaye published as early as 1986 on the Washington Law Review [30]. ...
Article
Full-text available
Following a fundamental statement made in 2016 by the American Statistical Associations and broad and consistent changes in data analysis and interpretation methodology in public health and other sciences, statistical significance/null hypothesis testing is being increasingly criticized and abandoned in the reporting and interpretation of the results of biomedical research. This shift in favor of a more comprehensive and non-dichotomous approach in the assessment of causal relationships may have a major impact on human health risk assessment. It is interesting to see, however, that authoritative opinions by the Supreme Court of the United States and European regulatory agencies have somehow anticipated this tide of criticism of statistical significance testing, thus providing additional support to its demise. Current methodological evidence further warrants abandoning this approach in both the biomedical and public law contexts, in favor of a more comprehensive and flexible method of assessing the effects of toxicological exposure on human and environmental health.
... Null hypothesis significance testing Null hypothesis significance testing (NHST) has been, until recently, the standard approach in data analysis and interpretation in most biomedical studies and even beyond this domain, heavily involving other scientific disciplines such as physics, economics, and psychology [1][2][3][4][5][6] . Indeed, there is probably no other aspect of statistics and methodology, in general, that has affected how data are collected and interpreted within a study, either experimental or observational, either in the human, in animals, or in vitro. ...
... dismissing a key role of null hypothesis testing according to Fisher's rule in establishing (or refuting) proof of causation. Unsurprisingly, a large number of scholars have expressed their appreciation for this highly relevant opinion, thus indicating how public law theory can take on board a correct approach in dealing with a highly specific and 'sophisticated' statistical concept such as statistical significance/null hypothesis testing 2,26,28,29 . This comes as no surprise, however, since the issues raised in this seminal sentence by the Supreme Court have long been known to the public law scholarship, as comprehensively illustrated in a relevant article by Kaye published as early as 1986 on the Washington Law Review 32 . ...
Article
Full-text available
Null hypothesis significance testing (NHST) text was once widely popular and almost systematically used for the identification of causal relations and for risk assessment in toxicology and medicine. Interestingly, the public law world has been more prudent and more advanced than the biomedical one in the use of this dichotomous approach, based on the conventional p-value cut-points of 0.05/0.001, to assess causality. The recent 2016 statement by the American Statistical Association, the joint action by methodologists in all fields of science, and not least the seminal decisions by the US Supreme Court have highlighted the pitfalls of the dichotomous approach embedded in NHST. Overall, they also indicated the need to entirely dismiss NHST when assessing causal relations, favoring instead a more flexible and adequate approach for data analysis and interpretation. The demise of statistical significance testing would have major beneficial implications for risk assessment in toxicology, public health, and human medicine, alongside important public law implications. It could also lead to a reanalysis and re-interpretation of previous studies and bodies of evidence that may have been inaccurately assessed due to the flaws inherent in NHST.
... von Wehrden, Schultner, and Abson 2015). As pointed out by Ziliak (2016), journals and funders should move toward incentivizing substantive significance as opposed to statistical significance. Visualization and other methods to display uncertainty (e.g. ...
... Power, Bayes factors, but also confidence intervals or simulations) could improve current practice. Large funders and research organizations could also list the abuse of p-values as scientific misconduct (Ziliak 2016). ...
Article
McCloskey and Ziliak analyzed two decades of econometric practice in the American Economic Review (AER). We review the arguments and develop a questionnaire, applying it to three Agricultural Economics journals and the AER. Statistical practice improved over time, but a greater focus on economic significance is still needed. Considering the power of tests and discussing the economic consequences of type I and type II error were rarely practiced. The AER and the American Journal of Agricultural Economics did not substantially differ in their performance. We discuss examples of statistical practice and conclude with implications for the publication process and teaching.
... P-value hacking, also known as p-hacking, data dredging or data fishing, is a QRP in which researchers repeatedly perform statistical tests on their data until they obtain a result that is considered significant (Ziliak, 2016). This can be done by manipulating the data, selecting only certain variables/sample members, or using different statistical methods until a desired outcome is achieved. ...
Preprint
Full-text available
Consumer psychology is facing various challenges, including a lack of research integrity and unethical publishing practices. This commentary lists pivotal events and discusses related findings that point to the field's need for reform. Open Science principles are proposed as a transformative solution to promote transparency in data, methodology, access, and peer review. Consumer psychology can only be revitalized and regain credibility if it fully embraces these four pillars. Academic and professional associations with an impact on consumer psychology must set a good example by cultivating a culture of integrity and accountability in research and publishing. Consumer psychologists must educate future generations of researchers on research methodology and research ethics.
... However, it must be mentioned that the measure of statistical significance is only relevant in survey data analysis and is meaningless for the present analysis which works with the total population of all Czech municipalities with extended powers (Soukup and Rabušic 2007). Furthermore, statistical significance itself does not imply that results have practical consequences, as it is neither necessary nor sufficient for proving the existence of an expected association (Wasserstein, Schirm, and Lazar 2019;Ziliak 2016). ...
Article
The main objective of this paper is to analyse where women run for and win seats in local councils of Czech towns between 1998 and 2018. Our results are to some extent contradictory to those from Western Europe. More importantly, this study demonstrates that strategic context impacts on women’s emergence and success in local elections in a different way. First, we do not confirm that larger towns are more promising for women’s representation. While more fragmented party systems in larger cities contribute to making women’s candidacy more common, a large pool of female candidates does not result in their higher presence in local councils. Second, we identify openness of local environment to women, in terms of women’s previous representation, as a strong determinant of female representation in Czech towns, both in terms of candidacy and success.
... However, others argue successfully why statistical significance should sometimes be discounted or ignored [e.g. [68][69][70]. ...
Article
Full-text available
This paper focusses on the application of spectral analysis and continuous wavelet transforms to water drainages from full-scale minesite components, such as open pits, waste-rock piles, and tailings impoundments. Three minesite-drainage databases included high-frequency monitoring (as frequently as every 15 min) and/or long-term monitoring (up to 31 years) of both flows and aqueous chemistries. These databases were cleaned only by deleting very obvious outliers and ignoring statistical significance, so that extreme events and fractal patterns could be detected. In all three full-scale minesite-drainage databases, 1-over-f fractal slopes were common in the spectral analyses, but other slopes mostly between zero and 2.0 were also found. Spectral analyses also produced anomalous spectral slopes. Simple simulations showed these could be explained by major unseen seasonal changes in water retention by upstream buried ponds or subsurface aquitards. Wavelet transforms for the three minesite-drainage databases provided important observations such as (1) the varying strengths of periodicity with time, (2) the differing periodicities between physical drainage flows and their aqueous chemistries, and (3) the effect of placing a fine-grained soil/till cover over a waste-rock pile. Based on all three minesite-drainage databases, the most common wavelengths for strong, persistent periodicities were 1 year and 1 week. Other wavelengths of strong periodicity for at least two minesites were 10 years, approximately 4 months, and half-monthly to monthly. The minesite with data as frequent as every 15 min also showed strong periodicities over 1 day and less.
... The ASA-statement highlights that the p-value does not provide a good measure of evidence regarding a hypothesis. In other words, it does not provide a 1 See, for example, McCloskey and Ziliak (1996), Sellke et al. (2001), Ioannidis (2005), Ziliak and McCloskey (2008), Krämer (2011), Ioannidis and Doucouliagos (2013), Kline (2013), Colquhoun (2014), Gelman and Loken (2014), Motulsky (2014), Vogt et al. (2014), Gigerenzer and Marewski (2015), Greenland et al. (2016), Hirschauer et al. (2016;2018), Wasserstein and Lazar (2016), Ziliak (2016), Amrhein et al. (2017), and Trafimow et al. (2018). This list contains but a small selection of the literature on p-value misconceptions from the last 20 years. ...
Article
The paper is motivated by prevalent inferential errors and the intensifying debate on p-values – as expressed, for example in the activities of the American Statistical Association including its p-value symposium in 2017 and the March 19 Special Issue on Statistical inference in the 21st century: A world beyond P < 0.05. A related petition in Nature arguing that it is time to retire statistical significance was supported by more than 800 scientists. While we provide more de-tails and practical advice, our 20 suggestions are essentially in line with this petition. Even if one is aware of the fundamental pitfalls of NHST, it is difficult to escape the categorical reasoning that is so entrancingly suggested by its dichotomous significance declarations. With a view to the p-value’s deep entrenchment in current research practices and the apparent need for a basic consensus on how to do things in the future, we suggest twenty immediately actionable steps to reduce widespread inferential errors. Our propositions aim at fostering the logical consistency of inferential arguments, which is the prerequisite for understanding what we can and what we cannot conclude from both original studies and replications. They are meant to serve as a discussion base or even tool kit for editors of economics journals who aim at revising their guidelines to increase the quality of published research.
... Dichotomization in conjunction with misleading terminology propagate cognitive biases that seduce researchers to make logically inconsistent and overconfident inferences, both when p is below and when it is above the "significance" threshold. The following errors seem to be particularly widespread: 1 1) use of p-values when there is neither random sampling nor randomization 2) confusion of statistical and practical significance or complete neglect of effect size 3) unwarranted binary statements of there being an effect as opposed to no effect, coming along with -misinterpretations of p-values below 0.05 as posterior probabilities of the null hypothesis -mixing up of estimating and testing and misinterpretation of "significant" results as evidence confirming the coefficients/effect sizes estimated from a single sample treatment of "statistically non-significant" effects as being zero (confirmation of the null) 4) inflation of evidence caused by unconsidered multiple comparisons and p-hacking 5) inflation of effect sizes caused by considering "significant" results only 1 See, for example, McCloskey and Ziliak (1996), Sellke et al. (2001), Ioannidis (2005), Ziliak and McCloskey (2008), Krämer (2011), Ioannidis and Doucouliagos (2013), Kline (2013), Colquhoun (2014), Gelman and Loken (2014), Motulsky (2014), Vogt et al. (2014), Gigerenzer and Marewski (2015), Greenland et al. (2016), Hirschauer et al. (2016;2018), Wasserstein and Lazar (2016), Ziliak (2016), Amrhein et al. (2017), and Trafimow et al. (2018). This list contains but a small selection of the literature on p-value misconceptions from the last 20 years. ...
Preprint
Full-text available
We suggest twenty immediately actionable steps to reduce widespread inferential errors related to “statistical significance testing.” Our propositions refer first to the theoretical preconditions for using p -values. They furthermore include wording guidelines as well as structural and operative advice on how to present results, especially in research based on multiple regression analysis, the working horse of empirical economists. Our propositions aim at fostering the logical consistency of inferential arguments by avoiding false categorical reasoning. They are not aimed at dispensing with p -values or completely replacing frequentist approaches by Bayesian statistics.
... Para profundizar más en los efectos del programa de aikido, y siguiendo las indicaciones de la sexta edición del manual de publicaciones de la APA (2010) respecto a la importancia de usar intervalos de confianza y medidas del tamaño de efecto en el reporte de resultados (aspecto reforzado por críticas al modelo tradicional de prueba de hipótesis nulas basado en el valor límite p<.05, según varios autores como Amrhein, Korner-Nievergelt, & Roth, 2017;Concato & Hartigan, 2016;Cumming, 2014;Greenland et al., 2016;Jiroutek & Turner, 2016;Perezgonzalez, 2015;Ziliak, 2016; además de lo planteado en una reciente declaración de la Asociación Americana de Estadística publicada en Wasserstein & Lazar, 2016), se calculó el porcentaje de cambio (PC) pretest-postest para las variables: mindfulness, los componentes psíquico (CP) y somático (CS) de la ansiedad, así como para su puntaje global. Para ello se usó la siguiente fórmula: Y en la misma línea, se calculó el tamaño de efecto (TE) de medidas repetidas para cada grupo, mediante la fórmula (Becker, 1988;Thomas, Nelson y Silverman, 2015): ...
Article
Full-text available
Resumen. El propósito del estudio fue examinar el efecto de la práctica del aikido sobre el mindfulness y el estado de ansiedad, en estudiantes universitarios sin experiencia previa en artes marciales. Se utilizó un diseño cuasi experimental intra sujetos con mediciones Pre y Post tratamiento, con un grupo control activo (estudiantes de Educación Física). Se midió mindfulness con la escala MAAS y la ansiedad con la escala de Hamilton. Se aplicó un entrenamiento centrado en el aprendizaje y práctica de diversas técnicas de aikido (waza) y de la forma en que debían ser recibidas dichas técnicas (ukemi), por 11 semanas (2 sesiones semanales de 2 horas cada una). Grupo experimental: n=12, con edades entre 18 y 62 años. Grupo control: n=12 estudiantes, con edades entre 21 y los 34 años. Resultados: la práctica de aikido mostró tamaños de efecto significativos y de magnitud moderada tanto en mindfulness, como en la ansiedad. La edad no explica estos hallazgos. Se justifican estudios de seguimiento. Abstract. The purpose of the study was to examine the effect of practicing aikido on mindfulness and anxiety state in university students with no previous experience in martial arts. We used an intra-subjects quasi-experimental design with Pre and Post treatment measurements, with an active control group (physical education students). Mindfulness was measured with the MAAS scale, whereas anxiety with the Hamilton scale. A training program focused on learning and practicing various aikido techniques (waza), and the way in which these techniques (ukemi) should be received, was implemented during 11 weeks (2 weekly sessions of 2 hours each). Experimental group: n = 12, with ages between 18 and 62 years old. Control group: n = 12 students, with ages between 21 and 34 years old. Results: the practice of aikido showed significant effect sizes of moderate magnitude in both mindfulness and anxiety. Age does not explain these findings. Follow-up studies are recommended.
... Para profundizar más en los efectos del programa de aikido, y siguiendo las indicaciones de la sexta edición del manual de publicaciones de la APA (2010) respecto a la importancia de usar intervalos de confianza y medidas del tamaño de efecto en el reporte de resultados (aspecto reforzado por críticas al modelo tradicional de prueba de hipótesis nulas basado en el valor límite p<.05, según varios autores como Amrhein, Korner-Nievergelt, & Roth, 2017;Concato & Hartigan, 2016;Cumming, 2014;Greenland et al., 2016;Jiroutek & Turner, 2016;Perezgonzalez, 2015;Ziliak, 2016; además de lo planteado en una reciente declaración de la Asociación Americana de Estadística publicada en Wasserstein & Lazar, 2016), se calculó el porcentaje de cambio (PC) pretest-postest para las variables: mindfulness, los componentes psíquico (CP) y somático (CS) de la ansiedad, así como para su puntaje global. Para ello se usó la siguiente fórmula: Y en la misma línea, se calculó el tamaño de efecto (TE) de medidas repetidas para cada grupo, mediante la fórmula (Becker, 1988;Thomas, Nelson y Silverman, 2015): ...
Article
Full-text available
El propósito del estudio fue examinar el efecto de la práctica del aikido sobre el mindfulness y el estado de ansiedad, en estudiantes universitarios sin experiencia previa en artes marciales. Se utilizó un diseño cuasi experimental intra sujetos con mediciones Pre y Post tratamiento, con un grupo control activo (estudiantes de Educación Física). Se midió mindfulness con la escala MAAS y la ansiedad con la escala de Hamilton. Se aplicó un entrenamiento centrado en el aprendizaje y práctica de diversas técnicas de aikido (waza) y de la forma en que debían ser recibidas dichas técnicas (ukemi), por 11 semanas (2 sesiones semanales de 2 horas cada una). Grupo experimental: n= 12, con edades entre 18 y 62 años. Grupo control: n= 12 estudiantes, con edades entre 21 y los 34 años. Resultados: la práctica de aikido mostró tamaños de efecto significativos y de magnitud moderada tanto en mindfulness, como en la ansiedad. La edad no explica estos hallazgos. Se justifican estudios de seguimiento.
... A. Article Retraction: There are many studies about retraction from different points of view: the reasons caused the research miscounting like "publication pressure" (Tijdink, Verbeke & Smulders, 2014) or different forms of misbehavior in researches (Fang, Steen & Casadevall, 2012;Noyori & Richmond, 2013;Noyori & Richmond, 2013;Gross, 2016;Ziliak, 2016;Sugawara et al, 2017), the growth of retraction in formal channel (Tijdink, Verbeke & Smulders, 2014;Couzin-Frankel, 2013;Fanelli, 2013), The phenomena of increasing the articles citations in the case of retraction (Pfeifer & Snodgrass, 1990;Budd, Sievert & Schultz, 1998), decreasing the citation after retraction happened (Furman, Jensen & Murray, 2012). ...
Conference Paper
Full-text available
Retracted articles are those papers with any kind of scientific misconducting rejected by publisher after publication date. This research is about retraction in 354 Biochemistry & Molecular Biology papers indexed in Web of science, to see if their traditional impacts has any relation with their modern. Using both scientometrics and altmetrics approaches, their citations and mentions is studied for 185 articles as the research sample. Results shows a growth in the retraction rate for this field, besides 67 citations and 263 mentions were calculated. There were no significant correlation between the traditional and modern impact of these articles; however, there were a correlation between traditional impact and Publication date as well as the modern impact and Publication date in these articles.
... Furthermore, statistical significance itself does not imply that results have practical consequences. As stressed by Ziliak (2016), there are increasing concerns that researchers should pay more attention to statistically "insignificant" findings that are nevertheless materially significant to the problems solved: "statistical significance is by itself neither necessary nor sufficient for proving a scientific … claim. Rational assessment of the probability or likelihood of a hypothesis cannot be derived from statistical methods alone" (Ziliak, 2016: 85). ...
Article
The main objective of this article is to analyse the determinants of women’s descriptive representation in the 2014 local elections in the Czech Republic and Slovakia. It is shown that although both countries are considered democratic and in spite of two decades of multi-dimensional transition, women are underrepresented at the local level. Especially electoral results in the municipalities which are considered sub-regional centres and where almost one-half of the population of both countries is concentrated are studied. As it is pointed out, factors like local population or political and institutional factors play an important role in women’s success in local politics.
Preprint
Full-text available
We suggest twenty immediately actionable steps to reduce widespread inferential errors related to “statistical significance testing.” Our propositions refer first to the theoretical preconditions for using p -values. They furthermore include wording guidelines as well as structural and operative advice of how to present results, especially in multiple regression analysis. Our propositions aim at fostering the logical consistency of inferential arguments by avoiding false categorical reasoning. They are not aimed at dispensing with p -values or completely replacing frequentist approaches by Bayesian statistics.
Article
Full-text available
Le premier cours de statistique joue un rôle important dans bon nombre de programmes de deuxième cycle, mais les enseignants font face à certains défis. D’un côté, le programme standard de bon nombre de cours de statistique du premier cycle accorde trop d’importance au test d’hypothèse traditionnel, ce qui remplit le cerveau des étudiants de mythes qui vont à l’encontre du perfectionnement de la littératie statistique. De l’autre, la qualité des cours de premier cycle varie grandement, si bien que les nouveaux étudiants de deuxième cycle peuvent présenter des niveaux fort variables de connaissances en statistique au début d’un cours de deuxième cycle. Le présent article vise à fournir des conseils constructifs aux enseignants des premiers cours de statistique de deuxième cycle, notamment des suggestions sur la façon d’évaluer les connaissances statistiques et prendre des mesures correctives précoces pour pallier les lacunes dans les connaissances de leurs étudiants. Les sujets essentiels qu’il conviendrait de couvrir dans un cours de statistique réformiste post valeur p sont aussi recommandés. Le but est d’aider les enseignants à promouvoir une compréhension plus moderne et véritablement digne du deuxième cycle des statistiques parmi les étudiants.
Article
Full-text available
The paper analyses women’s representation at the local level in Slovakia, or at the level of all (almost three thousand) Slovak municipalities, respectively. We focus on determinants of women’s descriptive representation in mayoral offices and how various factors (socioeconomic, cultural, or political) affect women’s political representation at this level. The main findings of the paper are that education or cultural factors (Catholicism and share of the population with Hungarian nationality) had only very limited effect on women’s representation, in contrast to the much stronger negative effect of the municipality size, which significantly decreased number of women in the position of Slovak mayors. However, we show that the strongest effect on women’s chances to be elected to the position of the mayor is whether women held mayoral post in a given municipality in a previous electoral term. This factor strongly favours women in following mayoral elections and at the same time it explained almost all variance in the dependent variable.
Article
Full-text available
Although there is an extensive comparative research focusing on the influence of various factors contributing to the increase of female representation at the national level, relatively little space is devoted to a similar research at (sub)state levels of governance. Hence, the main objective of this article is to analyse the determinants of women’s descriptive representation in Czech and Slovak regional elections. We show that women’s representation at the regional level is lower not only in comparison with the national but especially with the local level. Our results confirm that women are significantly advantaged in the regions where women held a much higher representation in the previous electoral term. However, other factors show only little positive (expected) influence on women’s representation in the Czech Republic (district magnitude), while we find negative influence of economic development (unexpected) and Catholicism (expected). On the contrary, in Slovakia, most factors influence women’s representation in the expected way. We find higher women’s representation in the regions characterized by higher economic development, higher district magnitudes, higher difference in salaries between men and women and a lower share of Catholics and Hungarians. Furthermore, electoral system proves to be a strong factor as a proportional system, together with higher magnitudes, strongly increases women’s representation. Generally, while the results from the Czech Republic indicate that women’s representation is influenced rather by institutional variables, together with greater openness for women based on previous experience, in Slovakia the relation among various factors is much more complex, influenced by all types of variables.
Article
Full-text available
Main objective of this article is to analyse political determinants of the descriptive representation of women at the local level in communal elections (i.e. the position of mayor) in the Czech Republic and Slovakia over the past decade. It focuses on the political opportunity structure (i.e. the structure of relationships that affect social and political behaviour) and questions whether this structure affects also women’s political representation. It shows that women are significantly advantaged in municipalities where women have held a mayoral post in a previous electoral term. In contrast to other studies, previous women’s representation in a municipal council is here found to have only a limited effect. The strong negative effect of the direct election of mayors and the negative effect of municipal size (only in Slovakia) indicate that women’s representation as mayors may be the result of interdependent phenomena that are a combination of institutional structure (e.g. electoral procedure, the mayor’s powers) and political contextual factors (past experience with a female mayor – not necessarily incumbents). This finding challenges earlier studies and it shows that any effort to identify a clear list of determinants of women’s representation as mayors is a complex task, making it difficult to pursue a broader comparative study in a different institutional environment or a different political culture.
Article
Full-text available
Article
Full-text available
Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.
Article
Full-text available
Significance The lack of reproducibility of scientific research undermines public confidence in science and leads to the misuse of resources when researchers attempt to replicate and extend fallacious research findings. Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance. Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.
Article
Full-text available
Because scientists tend to report only studies (publication bias) or analyses (p-hacking) that “work,” readers must ask, “Are these effects true, or do they merely reflect selective reporting?” We introduce p-curve as a way to answer this question. P-curve is the distribution of statistically significant p values for a set of studies (ps < .05). Because only true effects are expected to generate right-skewed p-curves—containing more low (.01s) than high (.04s) significant p values—only right-skewed p-curves are diagnostic of evidential value. By telling us whether we can rule out selective reporting as the sole explanation for a set of findings, p-curve offers a solution to the age-old inferential problems caused by file-drawers of failed studies and analyses.
Article
Full-text available
Comments on the contention of T. A. Ryan (see record 1985-21808-001) that the purpose of statistics is to establish new facts that will contribute to the development of theory. It is argued that the primary role of statistical analysis is summarizing the current state of knowledge about scientific questions under study. The present authors do not share the following views expressed by Ryan: (1) Nonsignificant results should not be published; (2) the importance of Type II errors has been overemphasized relative to Type I; and (3) avoiding consideration of importance and interest in the weighting of contrasts is possible. The reasons for the disagreements are discussed. (7 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Chapter
This entry provides an overview of the United States Supreme Court. Specifically, it discusses the creation and jurisdiction of the Court, the composition and procedures of the Court, and a history of case law that the Court has established.
Article
The Supreme Court in the United States of America is one of the most important institutions of the country. Indeed, it has always been in the heart of issues giving rise to high and difficult debates within the American society and political institutions. This article is an attempt to highlight the activities of this court, its history, its particularity and efficiency, and the role the Court plays not only as a judicial institution, but also as a decision making one. The paper also points at some salient issues characteristic for the structure and activity of the Supreme Court, such as the nomination of justices, the Congress battles concerning the confirmation of a Justice in the Supreme Court, and the background on which those nominee are selected, especially academic background, professional experience, ideology, religion and race.
Article
In several dozen journal reviews and in many other comments we have received-from, for example, four Nobel laureates, the statistician Dennis Lindley (2012), the statistician Arnold Zellner (2004), the mathematician Olle Häggström (2010), the sociologist Steve Fuller (2008), and the historian Theodore Porter (2008)-no one has ever tried to defend null hypothesis significance testing and its numerous errors. Recent articles by Thomas Mayer (2012, 2013), commenting on our book The Cult of Statistical Significance, are no exception. Of the five major claims we make in our book about the theory and practice of significance testing in economics, Mayer strongly agrees with four. On the fifth claim our disagreement is a matter of degree, not of kind, with no substantive change in results. Overall, Mayer agrees with us and with the new and growing consensus that statistical significance proves essentially nothing and has to change.
Article
P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.
Article
Many traditional multivariate techniques such as ordination, clustering, classification and discriminant analysis are now routinely used in most fields of application. However, the past decade has seen considerable new developments, particularly in computational ...
Article
Does the manner in which results are presented in empirical studies affect perceptions of the predictability of the outcomes? Noting the predominant role of linear regression analysis in empirical economics, we asked 257 academic economists to make probabilistic inferences given different presentations of the outputs of this statistical tool. Questions concerned the distribution of the dependent variable conditional on known values of the independent variable. Answers based on the presentation mode that is standard in the literature led to an illusion of predictability; outcomes were perceived to be more predictable than could be justified by the model. In particular, many respondents failed to take the error term into account. Adding graphs did not improve inferences. Paradoxically, when only graphs were provided (i.e., no regression statistics), respondents were more accurate. The implications of our study suggest, inter alia, the need to reconsider how to present empirical results and the possible provision of easy-to-use simulation tools that would enable readers of empirical papers to make accurate inferences.
Article
Like many Bayesians, I have often represented classical confidence intervals as posterior probability intervals and interpreted one-sided P values as the posterior probability of a positive effect. These are valid conditional on the assumed noninformative prior but typically do not make sense as unconditional probability statements. As Sander Greenland has discussed in much of his work over the years, epidemiologists and applied scientists in general have knowledge of the sizes of plausible effects and biases. I believe that a direct interpretation of P values as posterior probabilities can be a useful start-if we recognize that such summaries systematically overestimate the strength of claims from any particular dataset. In this way, I am in agreement with Greenland and Poole's interpretation of the one-sided P value as a lower bound of a posterior probability, although I am less convinced of the practical utility of this bound, given that the closeness of the bound depends on a combination of sample size and prior distribution.The default conclusion from a noninformative prior analysis will almost invariably put too much probability on extreme values. A vague prior distribution assigns much of its probability on values that are never going to be plausible, and this disturbs the posterior probabilities more than we tend to expect-something that we probably do not think about enough in our routine applications of standard statistical methods. Greenland and Poole1 perform a valuable service by opening up these calculations and placing them in an applied context.
Article
Differentiates between mathematical and scientific methods. The differences between scientific intuition and mathematical results have been attributed to the fact that scientific generalization is broader than mathematical description. While scientific methods deal with samples which are representative of the total whole, the mathematical methods measure the differences between the particular samples observed. Science begins with description but ends in generalization. Mathematical measures are too high and may need to be discounted in arriving at a scientific conclusion. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
It is conventionally thought that a small p-value confers high credibility on the observed alternative hypothesis, and that a repetition of the same experiment will have a high probability of resulting again in statistical significance. It is shown that if the observed difference is the true one, the probability of repeating a statistically significant result, the ‘replication probability’, is substantially lower than expected. The reason for this is a mistake that generates other seeming paradoxes: the interpretation of the post-trial p-value in the same way as the pre-trial α error. The replication probability can be used as a frequentist counterpart of Bayesian and likelihood methods to show that p-values overstate the evidence against the null hypothesis.
Article
Statistical analysis is universally used in the interpretation of the results of basic biomedical research, being expected by referees and readers alike. Its role in helping researchers to make reliable inference from their work and its contribution to the scientific process cannot be doubted, but can be improved. There is a widespread and pervasive misunderstanding of P ‐values that limits their utility as a guide to inference, and a change in the manner in which P ‐values are specified and interpreted will lead to improved outcomes. This paper explains the distinction between Fisher's P ‐values, which are local indices of evidence against the null hypothesis in the results of a particular experiment, and Neyman–Pearson α levels, which are global rates of false positive errors from unrelated experiments taken as an aggregate. The vast majority of papers published in pharmacological journals specify P ‐values, either as exact‐values or as being less than a value (usually 0.05), but they are interpreted in a hybrid manner that detracts from their Fisherian role as indices of evidence without gaining the control of false positive and false negative error rate offered by a strict Neyman–Pearson approach. An informed choice between those approaches offers substantial advantages to the users of statistical tests over the current accidental hybrid approach. LINKED ARTICLES A collection of articles on statistics as applied to pharmacology can be found at http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1476‐5381/homepage/statistical_reporting.htm
Article
“McCloskey and Ziliak have been pushing this very elementary, very correct, very important argument through several articles over several years and for reasons I cannot fathom it is still resisted. If it takes a book to get it across, I hope this book will do it. It ought to.” —Thomas Schelling, Distinguished University Professor, School of Public Policy, University of Maryland, and 2005 Nobel Prize Laureate in Economics “With humor, insight, piercing logic and a nod to history, Ziliak and McCloskey show how economists—and other scientists—suffer from a mass delusion about statistical analysis. The quest for statistical significance that pervades science today is a deeply flawed substitute for thoughtful analysis. . . . Yet few participants in the scientific bureaucracy have been willing to admit what Ziliak and McCloskey make clear: the emperor has no clothes.” —Kenneth Rothman, Professor of Epidemiology, Boston University School of Health The Cult of Statistical Significance shows, field by field, how “statistical significance,” a technique that dominates many sciences, has been a huge mistake. The authors find that researchers in a broad spectrum of fields, from agronomy to zoology, employ “testing” that doesn’t test and “estimating” that doesn’t estimate. The facts will startle the outside reader: how could a group of brilliant scientists wander so far from scientific magnitudes? This study will encourage scientists who want to know how to get the statistical sciences back on track and fulfill their quantitative promise. The book shows for the first time how wide the disaster is, and how bad for science, and it traces the problem to its historical, sociological, and philosophical roots. Stephen T. Ziliak is the author or editor of many articles and two books. He currently lives in Chicago, where he is Professor of Economics at Roosevelt University. Deirdre N. McCloskey, Distinguished Professor of Economics, History, English, and Communication at the University of Illinois at Chicago, is the author of twenty books and three hundred scholarly articles. She has held Guggenheim and National Humanities Fellowships. She is best known for How to Be Human* Though an Economist (University of Michigan Press, 2000) and her most recent book, The Bourgeois Virtues: Ethics for an Age of Commerce (2006).
Article
In economics and other sciences, "statistical significance" is by custom, habit, and education a necessary and sufficient condition for proving an empirical result. The canonical routine is to calculate what’s called a t-statistic and then to compare its estimated value against a theoretically expected value of it, which is found in "Student's" t table. A result yielding a t-value greater than or equal to about 2.0 is said to be "statistically significant at the 95 percent level." Alternatively, a regression coefficient is said to be "statistically significantly different from the null, p ? .05." Canonically speaking, if a coefficient clears the 95 percent hurdle, it warrants additional scientific attention. If not, not. The first presentation of "Student's" test of significance came a century ago in 1908, in "The Probable Error of a Mean," published by an anonymous "Student." The author's commercial employer required that his identity be shielded from competitors, but we have known for some decades that the article was written by William Sealy Gosset (1876–1937), whose entire career was spent at Guinness's brewery in Dublin, where Gosset was a master brewer and experimental scientist. Perhaps surprisingly, the ingenious "Student" did not give a hoot for a single finding of "statistical" significance, even at the 95 percent level of significance as established by his own tables. Beginning in 1904, "Student," who was a businessman besides a scientist, took an economic approach to the logic of uncertainty, arguing finally that statistical significance is "nearly valueless" in itself.
Article
Seventy-one "negative" randomized control trials were re-examined to determine if the investigators had studied large enough samples to give a high probability (greater than 0.90) of detecting a 25 per cent and 50 per cent therapeutic improvement in the response. Sixty-seven of the trials had a greater than 10 per cent risk of missing a true 25 per cent therapeutic improvement, and with the same risk, 50 of the trials could have missed a 50 per cent improvement. Estimates of 90 per cent confidence intervals for the true improvement in each trial showed that in 57 of these "negative" trials, a potential 25 per cent improvement was possible, and 34 of the trials showed a potential 50 per cent improvement. Many of the therapies labeled as "no different from control" in trials using inadequate samples have not received a fair test. Concern for the probability of missing an important therapeutic improvement because of small sample sizes deserves more attention in the planning of clinical trials.
Article
P-values are a practical success but a critical failure. Scientists the world over use them, but scarcely a statistician can be found to defend them. Bayesians in particular find them ridiculous, but even the modern frequentist has little time for them. In this essay, I consider what, if anything, might be said in their favour.
Article
Bayes factors have been offered by Bayesians as alternatives to P values (or significance probabilities) for testing hypotheses and for quantifying the degree to which observed data support or conflict with a hypothesis. In an earlier article, Schervish showed how the interpretation of P values as measures of support suffers a certain logical flaw. In this article, we show how Bayes factors suffer that same flaw. We investigate the source of that problem and consider what are the appropriate interpretations of Bayes factors.
Article
Ronald Fisher advocated testing using p-values, Harold Jeffreys proposed use of objective posterior probabilities of hypotheses, and Jerzy Neyman recommended testing with fixed error probabilities. Each was quite critical of the other approaches. Most troubling for statistics and science is that the three approaches can lead to quite different practical conclusions. This article focuses on discussion of the conditional frequentist approach to testing, which is argued to provide the basis for a methodological unification of the approaches of Fisher, Jeffreys and Neyman. The idea is to follow Fisher in using p-values to define the “strength of evidence” in data and to follow his approach of conditioning on strength of evidence; then follow Neyman by computing Type I and Type II error probabilities, but do so conditional on the strength of evidence in the data. The resulting conditional frequentist error probabilities equal the objective posterior probabilities of the hypotheses advocated by Jeffreys.
Why Most of the Studies You Read About Are Wrong Arts and Letterswhy-most-of-the-studies-you-read-about-are-wrong
  • J Bowyer
Methods of Statistics
  • F Y Edgeworth
Findings of Research Misconduct DC: U.S Department of Health and Human Services, Office of the Secretary
  • Research Office
  • Integrity
Making a Stat Less Significant The Numbers Guy
  • C Bialik
Editorial Statement on Negative Findings
The Unprincipled Randomization Principle in Economics and Medicine Oxford Handbook of Professional Economic Ethics
  • S Ziliak
  • E Teather-Posadas
) Theories of Probability
  • H Jeffreys
Lady Justice v. Cult of Statistical Significance: Oomph-less Science and the New Rule of Law
  • S Ziliak
  • D Mccloskey
  • Lavine M.
Supreme Court of the United States
  • D Mccloskey
  • S Ziliak