Article

The Case Against Statistical Significance Testing

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

At present, too many research results in education are blatantly described as significant, when they are in fact trivially small and unimportant. There are several things researchers can do to minimize the importance of statistical significance testing and get articles published without using these tests. First, they can insert statistically in front of significant in research reports. Second, results can be interpreted before p values are reported. Third, effect sizes can be reported along with measures of sampling error. Fourth, replication can be built into the design. The touting of insignificant results as significant because they are statistically significant is not likely to change until researchers break the stranglehold that statistical significance testing has on journal editors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The main aim of the tutorial is to illustrate the bases of discord in the debate against NHST (Gigerenzer, 2004;Macdonald, 2002), which remains a problem not only yet unresolved but very much ubiquitous in current data testing (e.g., Franco, Malhotra, and Simonovits, 2014) and teaching (e.g., Dancey and Reidy, 2014), especially in the biological sciences (Lovel, 2013;Ludbrook, 2013), social sciences (Frick, 1996), psychology (Gigerenzer, 2004;Nickerson, 2000) and education (Carver, 1978(Carver, , 1993. ...
... The p-value comprises the probability of the observed results and also of any other more extreme results (e.g., the probability of the actual difference between groups and any other difference more extreme than that). Thus, the p-value is a cumulative probability rather than an exact point probability: It covers the probability area extending from the observed results towards the tail of the distribution (Carver, 1978;Fisher, 1960;Frick, 1996;Hubbard, 2004). ...
... Note: P-values provide information about the theoretical probability of the observed and more extreme results under a null hypothesis assumed to be true (Bakan, 1966;Fisher, 1960), or, said otherwise, the probability of the data given a true hypothesis-P(D|H) ;Carver, 1978;Hubbard, 2004. As H 0 is always true (i.e., it shows the theoretical random distribution of frequencies under certain parameters), it cannot, at the same time, be false nor falsifiable a posteriori. ...
Preprint
Despite frequent calls for the overhaul of null hypothesis significance testing (NHST), this controversial procedure remains ubiquitous in behavioral, social and biomedical teaching and research. Little change seems possible once the procedure becomes well ingrained in the minds and current practice of researchers; thus, the optimal opportunity for such change is at the time the procedure is taught, be this at undergraduate or at postgraduate levels. This paper presents a tutorial for the teaching of data testing procedures, often referred to as hypothesis testing theories. The first procedure introduced is the approach to data testing followed by Fisher (tests of significance); the second is the approach followed by Neyman and Pearson (tests of acceptance); the final procedure is the incongruent combination of the previous two theories into the current approach (NSHT). For those researchers sticking with the latter, two compromise solutions on how to improve NHST conclude the tutorial.
... Se é incorreto interpretar que o valor-p é a probabilidade da H0 ser falsa, também é incorreto, ainda que também seja comum, interpretar que a probabilidade complementar ao valor-p (1-p) é a probabilidade da HA ser verdadeira (Carver, 1978;Cohen, 1990;Gigerenzer et al., 2012;Kline, 2013;Lambdin, 2012;Nickerson, 2000). A crença apresentada por muitos pesquisadores é de que o valor-p é a probabilidade da H0, portanto para estes a probabilidade aceitável, dados os resultados, de que a H0 é verdadeira é de 0,05 ou menos e que seu complemento, 0,95, é a probabilidade de que a HA é verdadeira. ...
... O entendimento de que o complemento de que o valor-p é a probabilidade de um resultado ser replicado também representa uma interpretação incorreta das informações produzidas pelo NHST (Badenes-ribera et al., 2015;Carver, 1978;Falk & Greenbaum, 1995;Gigerenzer et al., 2012;Kline, 2013;Nickerson, 2000;Sohn, 1998). A interpretação realizada por pesquisadores que defendem essa conjectura é a de que se a significância estatística for alcançada em 0,05 isso significa que o pesquisador pode afirmar que a cada 100 experimentos a diferença observada se manterá em 95 deles (Carver, 1978). ...
... O entendimento de que o complemento de que o valor-p é a probabilidade de um resultado ser replicado também representa uma interpretação incorreta das informações produzidas pelo NHST (Badenes-ribera et al., 2015;Carver, 1978;Falk & Greenbaum, 1995;Gigerenzer et al., 2012;Kline, 2013;Nickerson, 2000;Sohn, 1998). A interpretação realizada por pesquisadores que defendem essa conjectura é a de que se a significância estatística for alcançada em 0,05 isso significa que o pesquisador pode afirmar que a cada 100 experimentos a diferença observada se manterá em 95 deles (Carver, 1978). Para Carver (1978) nada na lógica das estatísticas fornece base para concluir que um resultado estatisticamente significativo seja interpretado como a probabilidade de replicação de um resultado. ...
Preprint
Full-text available
Amplamente adotada na Psicologia em geral, a estatística inferencial também é frequente na Análise do Comportamento (AB), abordagem que historicamente favoreceu estudos experimentais de caso único. O aumento do uso de pesquisas com grupos que utilizam testes de significância de hipótese nula (Null Hypothesis Significance Testing - NHST) na análise de dados tem crescido nessa área e traz consigo problemas relacionados (intrínsecos e por mau uso). Tais problemas muitas vezes passam despercebidos no atual sistema de revisão por pares, comprometendo a confiabilidade de algumas conclusões disponíveis na literatura científica. Neste artigo, explicamos os problemas relacionados ao uso indevido e à má interpretação do NHST e compilamos orientações para editores, revisores e autores que podem ser adotadas para minimizar os problemas mencionados.
... (cf. Bakan, 1966;Berkson, 1942;Carver, 1978Carver, , 1993Chow, 1998;Cohen, 1994;Dar, Serlin, & Omer, 1994;Hagen, 1997;Harlow, 1997;Hodges & Lehmann, 1954;Hogben, 1957;Hunter & Schmidt, 1990;Lykken, 1968;Meehl, 1967Meehl, , 1978Meehl, , 1990aMeehl, , 1990bMorrison & Henkel, 1970;Rozeboom, 1960;Sterling, 1959) and culminated in a special section of Psychological Science discussing whether the NHST should be banned (Abelson, 1997b;Estes, 1997b;Harris, 1997b;Hunter, 1997;Scarr, 1997;Shrout, 1997). The American Psychological Association Task Force on Statistical Inference was convened to determine what role, if any, NHST should have in psychological science (Schmidt, 1996;Wilkinson and the Task Force on Statistical Inference, 1999). ...
... Substantive and repeated efforts made over the past 75 years to remediate NHST misuses by discussing the technical merits and correct use of NHST have not been productive (e.g., Carver, 1978Carver, , 1993Kaiser, 1960;Kish, 1959;Serlin & Lapsley, 1985). There is no reason to expect that further efforts along these lines will be useful. ...
... implies replicability of p > .95 on the premise that rejection of the null hypothesis implies that the results are not due to chance and that therefore they must be both systematic and reproducible. Carver (1978) identified these views as common misunderstandings of NHST. Significance testing does not provide evidence of reproducibility. ...
Article
Full-text available
Null hypothesis statistical testing (NHST) has been debated extensively but always successfully defended. The technical merits of NHST are not disputed in this article. The widespread misuse of NHST has created a human factors problem that this article intends to ameliorate. This article describes an integrated, alternative inferential confidence interval approach to testing for statistical difference, equivalence, and indeterminacy that is algebraically equivalent to standard NHST procedures and therefore exacts the same evidential standard. The combined numeric and graphic tests of statistical difference, equivalence, and indeterminacy are designed to avoid common interpretive problems associated with NHST procedures. Multiple comparisons, power, sample size, test reliability, effect size, and cause–effect ratio are discussed. A section on the proper interpretation of confidence intervals is followed by a decision rule summary and caveats.
... Son rôle dans une procédure de tests de signification est limité soit à rejeter l'hypothèse nulle H 0 ou à échouer à son rejet. Elle ne dit rien à propos de l'hypothèse alternative H 1 (Carver, 1978). Une p-valeur est, d'abord, une fonction de la taille de l'effet réel, des fluctuations d'échantillonnages et des erreurs de mesures expérimentales (Carver, 1993). ...
... Pourtant, le test de signification ne permet pas de déterminer quelle partie des différences significatives est due à la taille de l'effet réel et quelle autre partie est due aux fluctuations d'échantillonnages et aux erreurs de mesures expérimentales. Carver (1978) et Thompson (1993), parmi d'autres, ont ainsi indiqué que les utilisateurs des tests de signification interprètent souvent, d'une manière erronée, une p-valeur comme une mesure de grandeur d'effet. Ceci est établi par les expressions suivantes, couramment utilisées en statistique : « résultat significatif », « résultat hautement significatif » et « résultat très hautement significatif » (Carver, 1978). ...
... Carver (1978) et Thompson (1993), parmi d'autres, ont ainsi indiqué que les utilisateurs des tests de signification interprètent souvent, d'une manière erronée, une p-valeur comme une mesure de grandeur d'effet. Ceci est établi par les expressions suivantes, couramment utilisées en statistique : « résultat significatif », « résultat hautement significatif » et « résultat très hautement significatif » (Carver, 1978). ...
Article
Full-text available
Une littérature abondante de travaux de recherches en didactique sur les tests statistiques rapporte diverses difficultés, souvent en rapport avec des conceptions erronées, que l’on peut rencontrer à tout âge et à tout niveau d’expertise. Les conclusions de ces recherches s’accordent sur le fait qu’en général, un étudiant de niveau moyen ne peut décrire l’idée de fondement de cet outil statistique, seul un calcul mécanique est souvent restitué, basé beaucoup plus sur la mémorisation que sur la réflexion et l’interprétation. Certes, de multiples facteurs sont à l’origine de ce caractère difficile. Ainsi, nous nous sommes focalisés dans cet article sur l’exploration de l’impact du langage (ambigu) probabiliste, utilisé dans des expressions usuelles sous-jacentes à la procédure des tests de décision, sur l’interprétation des concepts véhiculés, par les étudiants. Pour ce faire, nous avons choisi un échantillon constitué de 195 individus, auquel nous avons administré un questionnaire composé de cinq questions, impliquant des expressions usuelles en tests de décision. Les résultats auxquels nous sommes arrivés ont mis en évidence plusieurs anomalies, qui peuvent être prises en considération afin d’améliorer l’apprentissage de cet outil statistique.
... Anlamlılık testine olan güven, yıllar boyunca metodologların eleştirilerine maruz kalmıştır (Carver, 1978;Cohen, 1994;Guttman, 1985;Meehl, 1978;Oakes, 1986;Rozeboom, 1960). ...
... veya .01). Buna ek olarak, çoğu araştırmacı anlamlılık testlerini kullanarak, verileri yorumlamada olmayan pek çok fayda sağladığına inanmaktadır (Carver, 1978;Oakes, 1986;Schmidt, 2016). Çoğu araştırmacı, bir sonucun anlamsız olması durumunda, bunun sadece şanstan, başka bir yanlış inançtan kaynaklandığı sonucuna varabilir. ...
... Çoğu araştırmacı, bir sonucun anlamsız olması durumunda, bunun sadece şanstan, başka bir yanlış inançtan kaynaklandığı sonucuna varabilir. Bu anlamlılık testleri ile sağlanan bilginin yararlılığına dair yaygın ancak yanlış bir inançtır (Carver, 1978;Oakes, 1986;Schmidt, 2016). ...
Thesis
Full-text available
Bu araştırmanın genel amacı, ilkokul matematik derslerinde kullanılan öğrenci merkezli strateji, yöntem ve tekniklerin matematik dersindeki akademik başarıya etkisini konu edinen deneysel çalışmaları sentezlemektir. Senteze birincil araştırmaların dâhil edilmesine ilişkin belirli ölçütler kullanılmıştır. Tarama sonucunda 63 çalışma araştırmaya dâhil edilmiştir. Toplam örneklem büyüklüğü 4835‘dir. Araştırmada kodlama protokülünün güvenirliği “kodlayıcılar-arası güvenirlik” hesaplanarak iki aşamada sağlanmıştır. Güvenirlik (AR) 0.88 olarak hesaplanmıştır ve yeterli görülmüştür (AR>.80). Araştırmanın geçerliği; yayın yanlılığı ve birincil çalışmalardaki kalite değerlendirmesi, dil yanlılığı, zaman gecikmesi yanlılığı ve veritabanı yanlılığı ile sağlanmıştır. Yanlılığa rastlanmamıştır. Çalışmanın sistematik derleme bölümünde veriler betimsel analize tabi tutulmuştur. Meta-analiz sonuçları, öğrenci merkezli strateji yöntem veya tekniklerin geleneksel öğretim yöntemlerinden daha etkili olduğunu göstermektedir. 63 çalışmadan ortaya çıkan 66 etki büyüklüğü rastgele etki modeli altında analiz edildiğinde, etki büyüklüğü 0.787 olarak hesaplanmıştır. Ulaşılan genel etki değeri çeşitli sınıflandırmalara göre geniş ve orta düzeyde etkiyi göstermektedir. Moderatör değişkenler analizlerinde uygulama yaklaşımı moderatörü dışındaki uygulama süresi, yayın türü, veritabanı, sınıf düzeyi, kullanılan ölçek, okula başlama yaşı, ülke, örneklem sayısı moderatörleri için anlamlı farklılık bulunmamıştır.
... In a similar vein, concluding that "males and females significantly differ in their BMI scores" because the p-value associated with a mean (or other average-based) comparison is less than 0.05 is also very common but incorrect. This conclusion also stems from a dichotomous interpretation of p-values and, in this case, it falls in the so-called valid research hypothesis fallacy (Carver, 1993(Carver, , 1978, which takes the rejection of the null hypothesis as evidence supporting the researcher's hypothesis. However, the researchers' hypothesis is not under evaluation, only the null hypothesis is, and it is known beforehand that the null hypothesis of no differences is false 8 (though statistical power may not always be sufficient to demonstrate this; for an ampler discussion, see (Cohen, 1994;Hirschauer et al., 2022;Tukey, 1991)). ...
... For example, it is commonly assumed that, when p<0.05, the probability of replicating the same result in future studies must be >0.95 (replicability fallacy; (Carver, 1978)), and that the factor tested is "the cause" of the observed effect (causality fallacy; (Kline, 2013)). In terms of replicability, it's important to highlight that if the effect size in the entire population matches that observed in a sample where a t-test produces a p-value below 0.05, the likelihood of obtaining a p-value <0.05 in a replication study is 0.5, not 0.95 (Greenwald et al., 1996). ...
... Ha existido un gran debate por muchos años relacionado con lo que se ha denominado la crisis de la significación estadística (Carver, 1978;Cohen, 1994;1995;Thompson, 2004). Por ejemplo, Kline (2004; ha identificado cinco falacias sobre los valores de p ("The Big Five": Las Cinco Grandes) y otras doce falacias relacionadas con la toma de decisión de la significación estadística. ...
... • La falacia de la hipótesis de investigación válida (Carver, 1978) es la falsa creencia de que 1-p es la probabilidad de que la hipótesis alternativa sea verdadera. La cantidad 1-p es una probabilidad, pero es solo la probabilidad de obtener un resultado aún menos extremo bajo H 0 que el realmente encontrado. ...
Article
Full-text available
El propósito principal de este artículo es defender la posición de que algunas falacias alrededor de la aplicación de la estrategia frecuentista, representada por la tradición de la prueba estadística de hipótesis nula, han sido confundidas con las soluciones sustantivas en la investigación. Por otro lado, se intenta ilustrar cómo esta confusión incide en la crisis de la replicación (Costello & Watts, 2022) y, además, exponer cómo un investigador, dada sus propias limitaciones, debe enfrentar los retos que representan todas las estrategias estadísticas que están a su alcance. Se advierte sobre la alta probabilidad de que estas estrategias estén sujetas a la avaricia cognoscitiva que caracteriza a los seres humanos. Finalmente, la enseñanza de la sabiduría se ofrece como una alternativa para reducir dichos sesgos.
... If everything matters, then what is to be done about predominance of statistical significance in child welfare research as a marker for what is important? As Carver (1978) points out, the statistical tool is only as good as the user and the degree to which the user understands its utility and limitations. But is the use of statistics in child welfare becoming more than just a tool to being accepted dogma within quantitative child welfare research to the exclusion of other options such as network science? ...
... While Carver (1978) rejected statistical tests outright, he offered few useful alternatives. ...
Article
Full-text available
Plusieurs études menées au cours de la dernière décennie démontrent une relation claire entre la pauvreté et le risque de faire face à une intervention de la Direction de la protection de la jeunesse au Québec (DPJ). Bien que cette association soit courante dans toutes les administrations nord-américaines, elle est surprenante compte tenu du niveau relativement élevé de politiques sociales progressistes visant à réduire la pauvreté familiale. Bien que les études montrent clairement que la pauvreté des familles et des quartiers est liée au risque d’intervention de la protection de l’enfance, les mécanismes expliquant cette association ne sont pas clairs. La question de recherche de la présente étude est fondée sur des questions de distribution équitable des services. La présente étude s’appuie sur des études antérieures à l’échelle de la province du Québec afin d’examiner, dans une optique géographique, la relation entre la pauvreté et l’implication des services de la protection de la jeunesse, en analysant le rôle de la densité de la population infantile dans les régions du Québec. Les résultats montrent 1) que la densité de la population d’enfants varie considérablement dans la province, et 2) que la relation linéaire entre les rapports corroborés sur la protection de la jeunesse, le placement hors du foyer familial et la pauvreté est plus forte dans les régions à faible densité de population. Ces résultats soulèvent d’autres questions de recherche concernant le rôle des services dans toutes les régions géographiques en ce qui concerne le risque d’intervention des services de protection de la jeunesse pour les familles pauvres et les familles des quartiers pauvres. Cet article incite les décideurs et les chercheurs à considérer la notion d’équité spatiale dans la distribution des services dans les futures analyses de politiques publiques et études de recherche au Québec.
... I wholeheartedly concede this point, but cannot avoid remarking that a major part of classical statistics, i. e., "hypothesis testing", is entirely based on precisely such blunt cut-off points. 10 10 The absurdity of such crudely dichotomous criteria has been pointed out, among others, by Hogben (1957: 30); Selvin (1957); Rozeboom (1960); Morrison and Henkel (1970: 36 and 138-140); Tufte (1970: 439); Tukey and Wilk (1970: 338); Deutscher (1973: 202-203); Henkel (1976: 34-36 and 83-84); Carver (1978); Cohen (1994). The criticism has grown sharp in recent decades (see Kline 2004;Novella 2015;Amrhein and Greenland 2017;Denworth 2019). ...
... "Statistical sampling theory… suffers from the decisive deficiency of being monovariate…[Howewer,] in the stage of analysis every sample-base survey multivariate"(Harder 1969: 153). Also see criticism byCarver (1978),Quinn and Dunham (1983),Johnson (1999) Stephens et al. (2005).12 Gini and Galvani's (1929)finding was, and still is, very frequently quoted in the relevant literature. ...
Article
Full-text available
della redazione: In the social sciences literature the expressions ‘random sample’ and ‘representative sample’ are often used improperly and sometimes even interchangeably by students who seem to think that sample is representative in so far, and because, it is random. In this essay I shall discuss the proper use of ‘random’, ‘representative’ and related terms, and I shall argue that no logic nexus exists between the two concepts, nor does any causal relationship between the two sets of phenomena. The analysis begins with the term that raises the most annoying problems, and accordingly is less explored in the literature, viz, ‘representativeness’.
... p-값으로 대표되는 통계적 유의성(statistical significance)의 제한점들이 인지되면서, 표본을 통한 통계적 유의성이 실제로 어떤 정도의 치환되는지에 관한 실제적 유의성(practical significance)에 관한 것이 효과 크기이다 (Howell, 2010;Kline, 2004;Thompson, 2006 (Winer, 1971;Tolson, 1980;Kirk, 1982;Thomas & Nelson, 1990;Thomas, Salazar, & Landers, 1991;Cohen, 1997;Anderson, Burnham, & Thompson, 2000;Olejnik & Algina, 2003;Thompson, 2006). APA는 "반드시"("should always") p-값을 보고할 때는 효과 크기를 (Clark, 1963;Bakan, 1966;Cronbach, 1975;Carver, 1978 ...
Article
Full-text available
As the domain of sport sociology study has been diversed, many quantitative statistical methods have been evolved in many research. However, a not enough number of statistical methods has been mainly utilized in order to understand many social phenomena. Thus, based upon a set of theoretical rationale, the present study intended to verify the usage of quantitative methods, statistical tools, the report of population size, sample size, and the following sampling error, as well as effect size. The results show that a well balance between quantitative methods along with qualitative methods in Korean sport sociology area have been established in recent years. Nonetheless, diverse statistical tools have not been utilized enough, and population size and sampling error have poorly been reported so far. Also, statistical assumptions have not been properly tested, and their effect sized tend not to be reported to establish practical significance yet. 스포츠사회학의 연구가 다양해짐에 따라 양적 방법의 사용에도 다변화가 발견되고 있다. 하지만, 사회현상을 파악하기 위해 표본을 추출하고 표본의 분석을 일반화로 이어가는 추론에 있어서 각 연구들이 보고하는 방법과 내용이 충분치 못함이 지적되었다. 이에 이 연구는 이론적 근거를 바탕으로 최근 한국스포츠사회학지 논문에서 양적 연구의 빈도와 다양한 통계의 사용, 모집단 크기, 표본 크기, 표본 오차와 선택된 통계를 위한 논리적 근거, 결과에 대한 효과 크기가 어느 수준으로 보고되고 있는지를 기술분석을 통하여 파악하였다. 연구결과는 최근 국내 스포츠사회학에서 양적 연구가 질적 연구와 균형을 잘 맞추고 있지만 양적 방법의 다양성을 갖추지는 못하고 있다. 또한, 추론 통계에 합당한 모집단의 크기, 표본 크기, 표본 오차에 관한 정보를 제공하지 못하고 있으며, 통계를 위한 가정을 탐색하고 설명하는데 논리가 충분치 못한 것으로 판단된다. 또한 실제적 유의성을 위한 효과 크기가 제대로 보고되지않아 추론을 위한 연구에 충분치 못한 것으로 보인다. 이는 추론 통계 연구들이 일반화를 위한 충분한 구성 요소를 갖추는데 충분치 못함을 지적한다.
... p < 0.05 and all that! Though widely deprecated [17,96,23], null hypothesis significance testing (NHST) persists as the dominant approach to statistical analysis. Hence, the following fictitious 7 examples "smell bad": ...
Preprint
CONTEXT: There has been a rapid growth in the use of data analytics to underpin evidence-based software engineering. However the combination of complex techniques, diverse reporting standards and poorly understood underlying phenomena are causing some concern as to the reliability of studies. OBJECTIVE: Our goal is to provide guidance for producers and consumers of software analytics studies (computational experiments and correlation studies). METHOD: We propose using "bad smells", i.e., surface indications of deeper problems and popular in the agile software community and consider how they may be manifest in software analytics studies. RESULTS: We list 12 "bad smells" in software analytics papers (and show their impact by examples). CONCLUSIONS: We believe the metaphor of bad smell is a useful device. Therefore we encourage more debate on what contributes to the validty of software analytics studies (so we expect our list will mature over time).
... It is worth noting that there is a long-standing debate on the case for and against hypothesis testing and statistical significance versus effect size [e.g. see (Carver, 1993;Sawilowsky, 2003)]. ...
Article
The use of tramlines or wheelings to carry out agricultural operations, such as spraying and fertilizer applications, is common across the world. They are often orientated up and down the slope and the soil that is driven on becomes compacted because of machinery weight transferring stress through the soil profile. This compaction leads to tramlines becoming conduits for water moving over the soil surface. Like water, sediment and phosphorus are also detached and transported. Reducing surface runoff and diffuse pollution losses from surface runoff associated with wheelings has received some research attention, but results are often difficult to interpret. This is because of the low number of replicates that are possible in agricultural landscapes, if the research is to be conducted at meaningful scales and to remain feasible. To address this, we utilize effect sizes and confidence intervals to analyse surface runoff and diffuse pollution data from a series of studies at five arable field sites, in the UK where surface runoff, sediment and phosphorus was collected from hillslope scale tramline plots utilizing the same methodology. In addition, we tested the impact of very flexible tyres, rotary harrows and a surface profiler roller on surface runoff and diffuse pollution loses. Although the monitoring period did not encompass widespread flood inducing storms, we demonstrate that the magnitude of the sediment and total phosphorus (TP) losses from the tramline plots across the study sites are significant in the UK context. Annual sediment losses from the study plots are in the order of 0.5–4.5 Mg ha ⁻¹ yr. ⁻¹ and consistent with the magnitudes of soil erosion in the UK. TP fluxes observed at the study plots, ranged between 0.8 and 3.9 kg ha ⁻¹ yr. ⁻¹ , are consistent with the TP losses reported for surface runoff from arable plots in the UK. By utilizing effect size analysis, we demonstrate the significant impact of tramline mitigation on surface runoff and diffuse pollution losses. The rotary harrow performed best overall, and the combination of the rotary harrow and the very flexible tyre was superior to all other methods. This was the case for all treatments apart from some, where the surface profiler performed well in reducing sediment fluxes. Our work supports the need for incorporating tramline management measures into soil management strategies for arable landscapes and provides evidence for policymakers developing measures for agri‐environmental schemes.
... The tendency to drop the word statistically and use 'significant difference' instead of 'statistically significant difference' in research reports reflects a most common misconception that 'statistically significant means important'(Nassaji, 2012, p.95). As scholars (e.g.,Carver, 1993;Wei et al., 2020) have argued, the term 'statistically' must always precede the word 'significant'. It is highly recommended to use 'statistically' to modify '(non-)significant(ly)' wherever appropriate(Li, 2021). ...
Chapter
In the past three decades, there has been a surge in empirical studies exploring the benefits of Lx learning and bilingualism for individuals and society (Baker & Wright, 2021). However, there is limited knowledge about older adults (especially those aged 60 and above) learning an additional language, an emerging research area. This study synthesised 47 empirical research papers published between 1900 and 2022, revealing five major themes related to Lx learning in various countries. The synthesis highlighted non-cognitive benefits of Lx learning, such as positive language learning emotions, improved access to information, and subjective well-being. The study aims to inform stakeholders about the value of Lx learning and encourage multidisciplinary research and promotion of Lx learning and bilingualism.
... To test if this difference is significant, a paired t-test was performed. With a statistical significance level of 5% ( = 0.05), the common significance level used in educational research (Carver, 1978), the t-test results and normalized gain score ( ) are presented in Table 15. Table 15 explains that there is a significant difference between the pre-test and post-test scores, as indicated by the p-value ( < 0.001), which provides enough evidence to reject the claim that the difference in pre-test and post-test scores is of random chance. ...
Article
With the emergence of novel advanced technologies, innovations in educational research have allowed the integration of learning into game-based approaches. As students view chemistry as a complicated subject, inadequate performance in the subject is observed. Game-based materials were shown to increase performance in their chosen subjects, and as such, this study aims to improve the performance of students in thermochemistry using a game-based approach called “Thermika.” The evaluation of the app's suitability shows that the game is suitable, gaining a mean of 4.58. Analysis of the pre-test and post-test scores from participants who have participated in either the individual or group-based application of the intervention showed a significant difference (p<0.001) in their scores, which favored the post-test. This statistical significance is further supported by the approximately equal normalized gain score of 0.180 (g=0.180) of both groups and the respective effect sizes of 0.802 and 0.708 for the individual and group-based interventions, which ascertains the game’s practical significance as having a large effect. As implied by the evaluations and tests done to determine the statistical and practical significance of the application of the intervention, “Thermika” proved to be a suitable and effective pedagogical tool for improving performance and understanding in thermochemistry.
... Carver, "Tekrar, bilimin köşe taşıdır.'' diyerek tekrarlamanın önemine değinmiş ve tekrarlama ile elde edilen sonuçların güvenirliğini test ederek, yeniden örnekleme yöntemlerine olan güveni ortaya koymuştur [8]. Böylece oluşturulan örneklerin, ana kütle için genelleştirilebilir güvenli tahminler elde edilebilen yöntemler olduğu söylenilebilir. ...
Article
Full-text available
Örnekleme işlemi veya süreci, bilimsel araştırma yapmanın en önemli aşamalarından birisidir. Örnekleme, ana kütle içerisinden ana kütleyi daha iyi temsil edecek şekilde tesadüfi olarak daha küçük örnek birimi alma işlemine denir. Diğer bir ifadeyle, örnekleme yapmaktaki amaç, ana kütle hakkında tutarlı ve geçerli bir tahminde bulunmak için örnekleme hatasını minimuma indirgemektir. Farklı kategoriler altında yer alan birçok örnekleme yöntemi bulunmaktadır. Son yıllarda ilerleyen teknoloji ile birlikte, temel örnekleme yöntemlerinin bir takım dezavantajlarının olduğu gözlenmiştir. Bu temel örnekleme yöntemlerindeki dezavantajları nedeniyle yeniden örnekleme yöntemleri geliştirilmiştir. Yeniden örnekleme yöntemleri örnek verilerini tekrar tekrar işleme tabi tutarak istatistik bilgiler sunmaktadır. Hızla gelişen teknolojiyle birlikte bu yöntemler, 1990’larda bilgisayar tabanlı yöntemler olarak uygulamadaki yerini almış ve hem parametrik hem de parametrik olmayan dağılımlar için temel yöntemlerle sınırlı kalmayıp, daha büyük veri setleri kullanarak iadeli ve iadesiz işlemler yapılabilmiştir. Bu çalışmada, yeniden örnekleme yöntemlerinden jackknife ve bootstrap yöntemleriyle; ortalaması 10 olan ana kütleden, 100 ve 300 birimlik örnekten çekildiği varsayılan n (10, 20, 30, 40, 50, 60, 70, 80) hacimli (bootstrap) örneklere ait ortalama ve güven aralığı değerleri incelenmiştir.
... Wei et al., in press), which could be called a bilingual advantage among the older emergent bilinguals. Carver, 1993) and applied linguistics (e.g. Wei et al., 2020) have argued that the term 'statistically' must always precede the word 'significant'. ...
Article
With increasing attention paid to the effects of learning a foreign language (FL) on older adults in the currently ageing world, psychological individual difference (ID) variables (e.g. learning motivation) remain much under-investigated, compared with cognitive IDs. This exploratory study examined older adults’ English learning motivation in the Chinese context of English as a foreign language (EFL) by conducting a web-based survey ( n = 510) and semi-structured interviews ( n = 21). Results showed that (1) the selected sociobiographical variables influenced older adults’ English learning motivation to different degrees, among which education, use frequency of English and socioeconomic status (SES) emerged as very important predictors; and (2) four motivators for English learning by older adults emerged as traveling or visiting relatives abroad, keeping the brain in shape, supporting inter-generational communication, and having general interest in the target FL. As one of the first systematic attempts to explore English learning motivation among Chinese older adults, the present study (1) contributes to a further understanding of English learning motivation among older adults in the Chinese EFL context, and (2) provides pedagogical and policy implications for English language teaching targeting older adults.
... Table 1. Different views about the practical application of the null hypothesis significance testing Adopted view Researchers In favor Levin (1993) Fritz (1995y 1996) Greenwald et al. (1996 Abelson (1997) Cortina Dunlap (1997) Hagen (1997) Detractors Bakan (1966) Craig et al. (1976) Carver (1978y 1993 Chow (1988) Thompson (1988Thompson ( , 1989Thompson ( , 1996Thompson ( , 1997Thompson ( and 1999 Cohen (1990, 1994) Falk and Greenbaum (1995 Schimdt (1996) Manzano (1997) Nickerson ( La tabla 1 muestra un análisis cronológico del comportamiento histórico relacionado con el contraste y comprobación de hipótesis estadísticas. Estos estudios evidencian la confusión, crítica y polémica entre los investigadores, que en un inicio consideraron que era suficiente el informe del valor p para rechazar o aceptar una hipótesis (Ioannidis 2018). ...
Article
Full-text available
The contrast hypothesis constitutes the most used method in the scientific research to estimate the statistical significance of any find. However, nowadays its use is questionable because it did not have other statistical criteria that make possible the credibility and reproducibility of the studies. From this condition, this study review how the use of the null hypothesis significance testing has been and the recommendations made regarding the application of other complementary statistical criteria for the interpretation of the results. It is described the main controversy of only use the probability value to reject or accept a hypothesis. The interpretation of a non significant value, as prove of effect absence or a significant value as existence of it, is a frequent mistake in scientific researchers, according to the reviewed literature. It is suggested to make a rigorous assessment of the obtained data in a research and include in the study reports other statistical tests, as the test power and the effect size of the intercession, to offer a complete interpretation and increase the results quality. Specifically, it is recommended to the editors of scientific journals to consider the report of the mentioned statisticians in the papers who required, as part of the criteria to take into account for their evaluation.
... "Sign econometrics" is about stating the direction of the coefficient but not its size (McCloskey & Ziliak, 1996;Ziliak & McCloskey, 2004b). However, 'sign' is not economically significant unless the magnitude is large enough to matter, and statistical significance does not indicate whether it is large or small (Carver, 1993;Sullivan & Feinn, 2012;Wasserstein & Lazar, 2016;Ziliak, 2016). Low P values do not necessarily imply large or more important effects (Wasserstein & Lazar, 2016). ...
Article
A recent paper in Management Accounting Research (MAR) claimed that the validity of positivistic management accounting research (PMAR) has increased significantly during the last four decades. We argue that this is a misrepresentation of reality as the current crisis of irreproducible statistical findings is not addressed. The reliability and validity of statistical findings are under an increasing pressure due to the phenomenon of Questionable Research Practices (QRPs). It is a phenomenon argued to increase the ratio of false-positives through a distortion of the hypothetico-deductive method in favour of a researcher’s own hypothesis. This phenomenon is known to be widespread in the social sciences. We therefore conduct a meta-analysis on susceptibility of QRPs on the publication practices of PMAR, and our findings give rise to reasons for concern as there are indications of a publication practice that (unintentionally) incentivises the use of QRPs. It is therefore rational to assume that the ratio of false-positives is well-above the conventional five-per cent ratio. To break the bad equilibrium of QRPs, we suggest three different solutions and discuss their practical viability.
... Without statistical significance, some researchers argued that there was no need to report effect sizes (Sawilowsky, 2003;Sawilowsky & Yoon, 2002). However, a plethora of scholars have recommended that effect sizes should be reported and interpreted in the absence of statistical significance according to specific contexts (Cahan, 2000;Carver, 1993;Cumming & Finch, 2001;Harlow et al., 1997;Henson & Smith, 2000;Roberts & Henson, 2003). According to other researchers, Schmidt (1996) even stated that effect size estimates and confidence intervals are in preference to significant value (e.g. ...
Article
The purpose of this study is to use Monte Carlo method to detect the most precise and least biased effect sizes calculations in a variety of conditions. The results show that there is no big difference to obtain effect sizes of using mean difference or trimmed mean difference as denominator. Cohen’s dA proves to be the less unbiased but more precise across all the conditions in Welch t test. It is worthwhile to notice that Hedges’ g remains the same as Cohen’s dP across all the conditions of Welch t test. When group sample sizes are equal, no matter which population effect size formula are applied, Cohen’s dA, Cohen’s dP, and Hedges’ g are the same estimates given the bias statistics.
... The present study is a replication and extension of this recent work. The study has replication features, an often neglected aspect of behavioral research (Carver, 1978), as it uses a different nursing home setting than previous studies. The study has also a feature of extension by employing a newly developed method of sequential analysis (Sackett, 1977). ...
Article
Full-text available
In an observational-operant design study with 17 staff members and 36 elderly nursing home residents, sequential observations of resident–staff interactions were recorded daily during morning care over a 23-day period. Results suggest that independent behavior in self-maintenance care is not maintained by staff behavior but perhaps by intrinsic reinforcers or reinforcing agents other than staff, whereas dependent behavior is directly maintained by staff reinforcement. (16 ref)
... When reporting results, we use -values as one component that jointly with the correlations contribute to our understanding of the data [49], and following [21], we interpret 0.1 ≤ < 0.2 as a small effect size, 0.2 ≤ < 0.3 as a medium effect size, and ≥ 0.3 as a large effect size 3 . Finally, we do not make threshold-based claims of statistical significance [12] and do not perform corrections for multiple testing to avoid arriving at overly stringent interpretation of study outcomes [10,40]. ...
... Compounding this want of justifiable generalizability in the 47 studies, and in answer to question 3, was a notable absence of direct or systematic replication as a research strategy. As Carver (1978) points out, replication can be one solution to the limited generalizability found in much educational research. We were encouraged by the 23% of studies which were coded as extending findings. ...
Article
Full-text available
This study was conceived as a systematic replication of a content analysis of published science education research conducted by Horton et al. in 1993. As such, 47 research articles published in "Science Education" between 1988 and 1992 were examined. Also, this study further extended the findings of Shaver and Norton, and Wallen and Fraenkel, who conducted similar analyses of general and social studies research. One major objective in this analysis was to determine whether science education researchers routinely practice commonly recommended research procedures. In addition, reviewers were interested in whether direct or systematic replication, common practices in other disciplines, play significant roles in science education research. The method of analysis and a discussion of the results are included.
... First of all, the most problematic of misunderstandings about the P-value is that it misinterprets "the probability of observing the data value of the sample under the condition that the null hypothesis is correct (true)" as "the probability of the null hypothesis being true based on the observed sample statistic" (Carver, 1978;Nickerson, 2000). Since the value of P is calculated on the premise that the null hypothesis is true, it cannot be a value of the probability that the null hypothesis is true. ...
Article
Full-text available
A testing method to identify statistically significant differences by comparing the significance level and the probability value based on the Null Hypothesis Significance Test (NHST) has been used in food research. However, problems with this testing method have been discussed. Several alternatives to the NHST and the P-value test methods have been proposed including lowering the P-value threshold and using confidence interval (CI), effect size, and Bayesian statistics. The CI estimates the extent of the effect or difference and determines the presence or absence of statistical significance. The effect size index determines the degree of effect difference and allows for the comparison of various statistical results. Bayesian statistics enable predictions to be made even when only a small amount of data is available. In conclusion, CI, effect size, and Bayesian statistics can complement or replace traditional statistical tests in food research by replacing the use of NHST and P-value.
... When it comes to chance, we usually examine and assess it using the p-value. If the p-value is 0.05 or less, we observe chance within the probability of 5% or less; a form of 'odds-against-chance' fantasy (Carver, 1978) or 'illusion of attaining improbability' (Falk and Greenbaum, 1995) Technically, we are saying that the null hypothesis is a proposition which shows that there is no relationship between the variables we are looking at, and even if there's any relationship observed, they happen by chance with a probability of 5% or less. For null hypotheses that are 'candle' in our context, we want to make sure that the differences arising from the question with obscurity effect and question post-effect are subjected to 5% or less of the probability of chance. ...
Thesis
Full-text available
Behavioral economics is a study of decision-making from the perspective of individual’s or institutions’ behavior arising from a departure from the classical economic theory. One example would be the differences in setting buying price and selling price for the same product due to the attachment one has with it. In an uncertain environment where events change rapidly, consumers are not able to indicate their buying price and selling price of the same product rationally, leading to a disparity in prices. And this disparity can be noticeably huge. Over the past 50 years, researchers in the field of psychology, decision science, and economics have studied the varied nature of the irrationality of consumers, and concluded that loss aversion remains the most credible explanation for the disparity. While buyers seek market valuation of products, sellers seek compensation for losing a product when it is sold. Researchers have named this phenomenon the endowment effect. In our research, we have shown that the endowment effect does not apply to a particular class of products - the time-sensitive and value-depreciating products or TSVD products. The selling price is determined by the loss in losing the chance to sell the TSVD products across different points in time, which is fundamentally expressed in the Loss Aversion Sensitivity function or LAS function. And we have shown that the selling price on average can be higher than the buying price on average when consumers’ field of decision-making is obscured. We call this effect the obscurity effect. We researched the obscurity effect using a quantitative survey questionnaire and tested 6 hypotheses in within-subject and between-subject designs using non-parametric and parametric statistical methods. We concluded that the TSVD products follow the LAS function, and the behavioral pattern was disrupted when the obscurity effect is observed. In our discussion, we have provided some explanations for the obscurity effect. These explanations include concepts coming from cognitive psychology, social psychology, and emotions. Finally, we presented two use cases whereby businesses can benefit from curtailing the obscurity effect. Definitions used in this paper are defined by mathematical logic and reasoning.
... However, significance testing alone does not validate the perspective about the data but is also influenced by sample size, past results, and study design. Nevertheless, these findings provide a broad, generalized perspective on the population (Carver, 1978;Johnson, 1999;Kaye, 1986). ...
Article
Full-text available
The Red panda (Ailurus fulgens) habitat has been providing several ecosystem services (ES) to the people; however, the differences in local stakeholders' perceptions and preferences of these ecosystem services based on differences in their location, caste, gender, age, and engagement in CFUG are still understudied. This study was conducted using a telephone interview with 120 households from 28 Red panda habitat districts in the Hima-layan range of Nepal. Respondents were asked to: (1) prioritize and rank the environmental (regulatory ES), economic (provisional ES), social, cultural, and spiritual importance of the Red panda habitat areas; (2) identify and prioritize the provisional ES; and (3) share their perceptions about the current state of the forest and biodiversity in comparison to the past decade to assess the change in Red panda habitat condition. Key findings include: (1) gender, caste, location, and involvement in community forest had a significant influence on people's perception and preference for ecosystem services (p < 0.05); (2) in overall, the environmental value of forests was significantly prioritized over the social, cultural, and economic values (p < 0.05); (3) provisional services such as fuelwood and fodder were significantly prioritized by Dalit and indigenous people and CFUG members, whereas timber was given the highest priority by the Brahmin and Chhetri caste groups (p < 0.05); and (4) forest cover, biodiversity, and forest condition have significantly improved in East Nepal over the past 10 years, while the reverse was true in West Nepal (p < 0.05). Information on the preferences of local communities could assist in planning, policymaking, and effective management of natural ecosystems and ecosystem services. More importantly, the findings provide a better understanding of the nature-human interactions in the Red panda region and indicate that people from marginalized groups (ethnic communities, Dalit, and women) still rely on forests (community forests in many cases), and any consideration in future policies should take this into account.
... -В рамках NHST анализируются результаты отдельного исследования, а на основании полученных статистических показателей делаются широкие обобщения. При этом не учитываются предшествующие исследования и вообще какие-либо данные за пределами полученного набора данных [Carver, 1978;Robinson, Wainer, 2001;Schneider, 2015]. На уровне теоретического обсуждения это, как правило, не так, но на уровне статистики при проверке гипотез в рамках NHST работа осуществляется в вакууме, без учета уже существующих исследований. ...
Article
Статья посвящена критическому обсуждению проблемы статистического вывода и методологии проверки нулевой гипотезы. В ней рассматриваются основные недостатки этого подхода к статистическому оцениванию данных. Выделяются несколько уровней критики проверки нулевой гипотезы: собственно статистический, связанный с процедурами и допущениями, стоящими за этой методологией, уровень социальных последствий, связанных с доминированием данного подхода в статистике, приводящий к ошибкам в интерпретации получаемых результатов, и, наконец, уровень соотнесения статистического и содержательного (психологического) анализа. Далее рассматриваются основные альтернативы, предлагающиеся в настоящее время для преодоления проблем, вызванных использованием методологии проверки нулевой гипотезы, дается их критическая оценка. Формулируется предварительный вывод о недостаточности изменения способов оценивания изолированного исследования.
... There is significant debate and concern surrounding the use of significance testing and p-values, particularly in light of the replication crisis. Indeed, many scientists have argued that significance tests should be abandoned altogether (Anderson et al., 2000;Carver, 1993;Gill, 1999). This view has been echoed in management research, where some recommend we "let go of statistical significance once and for all" (van Witteloostuijn, 2020, p. 275), "escape the straightjacket of NHST" (Lockett et al., 2014, p. 870), "stop relying on NHSTs" (Schwab et al., 2011(Schwab et al., , p. 1106, or even that "[i]t would be better for journals to ban pvalues as well" (Starbuck, 2016, p. 74). ...
Preprint
Full-text available
The use of fixed alpha levels in statistical testing is prevalent in management research, but can lead to Lindley's paradox in highly powered studies. In this article, we propose a sample size-adjusted alpha level approach that combines the benefits of both frequentist and Bayesian statistics, enabling strict hypothesis testing with known error rates while also quantifying the evidence for a hypothesis. We present an R-package that can be used to set the sample size-adjusted alpha level for generalized linear models, including linear regression, logistic regression, and Poisson regression. This approach can help researchers stop relying on mindless defaults and avoid situations where they reject the null hypothesis when the evidence in the test actually favors the null hypothesis, improving the accuracy and robustness of statistical analysis in management research.
... 32,33 Consequently, many statisticians argue against using p-value thresholds for hypothesis testing. [34][35][36][37][38][39] p-values should be part of the analysis of results, with the understanding that a low p-value may have many causes. It may be that the alternative hypothesis proposed is true, or it may mean one of many other alternative hypotheses is true. ...
Article
Flaws in experimental statistics are a major contributor to the poor reproducibility of animal experiments. Informed decisions about whether conclusions are justified requires clear reporting of experimental data and the statistical methods used to analyse them. When data are misinterpreted, manipulated or concealed to generate publications, it creates an illusion that chance observations are robust data which confirm the hypotheses presented. Attempts to reproduce and advance such observations can propagate large areas of irreproducible science. This hinders scientific progress, erodes public support for research, damages reputations and wastes resources. This review analyses and explains recommendations to improve use and reporting of statistics in animal experiments.
... Faced with these systematic problems there have been increasingly widespread calls for the wholesale abandonment of the significance testing approach (e.g. McShane et al., 2019;Amrhein and Greenland, 2018;Hunter, 1997;Carver, 1978). ...
Preprint
Full-text available
There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard `point-form null' significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both within- and between-experiment variation. This `distributional null' approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses between-experiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first `Many Labs' project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both within- and between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into `predictor-target' pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results.
... These procedures are frequently misunderstood (Carver, 1978;Kline, 2004;Meehl, 1978), with many interpreting p-values to denote the probability that a result occurred simply due to chance, or that a p-value of less than 0.05 represents substantial evidence in favour of the alternative hypothesis. These erroneous, but frequent, interpretations, as Dienlin et al. (2020) note, have led to the incorrect belief that statistical significance is representative of 'real' effects and that only statistically significant results should be published-a substantial publication bias (Rosenthal, 1979;Iyengar & Greenhouse, 1988;Harrison et al., 2014). ...
Preprint
Full-text available
Academic research, as a social system, functions as a market between research producers and research consumers. Conventionally, this market introduces information asymmetries between these market participants—research producers know far more about a study than research consumers. This asymmetry can erode trust in the credibility of findings. Transparency, through open science practices, can enable readers to evaluate the credibility of a claim and, by reducing this information asymmetry, promote confidence in our findings. This paper discusses factors that can undermine trust in the credibility of findings, introduces key open science practices that are broadly applicable for IS, considers the current discourses on these practices in IS, and systematically reviews the level of adoption of open science practices in IS as is evident in empirical, quantitative articles published in the Senior Scholars Basket of Eight Journals. This investigation provides a baseline for the adoption of open science practices in our discipline and gives insight into the extent to which claims in IS research can be assessed. The results of the evaluation suggest that, prior to recent reforms, there exists a substantial degree of information asymmetry and a serious neglect of transparency in IS research. Transparency is a necessary component for the continued growth of IS as a credible science and considerable reform is required.
... The SD is used as the denominator and therefore it is often referred to as a standardised ES, also known as Cohen's d ('d' for difference) (Cohen, 1988). Although many researchers preclude joint presentation of NHST and ES (Carver, 1993;Schmidt, 1996), they do together which neither could do alone (Thompson, 2000). For example, observing a large but non-significant effect, with emphasis placed solely on the magnitude of the effect rather than on the statistical non-significance could result in a Type I error (Kaiser, 1970); where a false hypothesis is incorrectly accepted. ...
Thesis
Full-text available
The primary aim of this thesis was to evaluate the dietary intake, energy expenditure and energy balance of young professional male rugby league players across the season.
... The current research strives to tackle this need, specifically drawing upon the work by Badenes-Ribera and colleagues (2015), who identified the rates of four specific p-value fallacies, namely the inverse probability (IP;Sharver, 1993;Kirk, 1996), replication (R;Carver, 1978;Fidler, 2005;Kline, 2013), effect size (ES; Gliner, Vaske, & Morgan, 2001), and clinical or practical significance (CPS; Kirk, 1996). Their findings demonstrated that academic psychologists were particularly prone to the IP fallacy (93.8% error rate), a misconception that derives fundamentally from falsely assuming that one can draw conclusions about the probability of a theory or hypothesis, given sample data. ...
... Hypothesis testing, regardless of whether it is set within the confines of the Fisherian framework or the Neyman-Pearson decision theoretic framework or a combination of the two, is one of the most difficult topics for students and researchers to understand (Jones, Lipson, & Phillips, 1994). Many researchers (e.g., Carver, 1978;Cohen, 1994;Daniel, 1997;Falk & Greenbaum, 1995;Haller & Krauss, 2002;Hurlbert & Lombardi, 2009;Johnson, 1999;Mulaik, Raju, & Harshman, 1997;Nickerson, 2004;Schmidt, 1996) have documented persistent misconceptions related to hypothesis testing. These misconceptions include, but are not limited to: ...
Conference Paper
Hypothesis testing reasoning is recognized as a difficult area for students. Changing to a new paradigm for learning inference through computer intensive methods rather than mathematical methods is a pathway that may be more successful. To explore ways to improve students’ inferential reasoning at the Year 13 (last year of school) and Stage One university levels, our research group developed new learning trajectories and dynamic visualizations for the randomization method. In this paper we report on the findings from a pilot study including student learning outcomes and on the modifications we intend to make before the main study. We discuss how the randomization method using dynamic visualizations clarifies concepts underpinning inferential reasoning and why the nature of the argument still remains a challenge.
... So what makes such ideas overstay their welcome? For instance, statistical significance testing is a commonly used example of an extensively used and beneficial method, that is considered less favorable now due to several adverse effects of its practical application in science (Carver, 1978(Carver, , 1993Johnson, 1999;Brockman, 2015). The key arguments made are not that significance testing is wrong, but that effect size is often not consid-1 Based on Planck (1950). ...
Article
Full-text available
In 2015, John Brockman edited a volume of chapters contributed by leading thinkers from various domains discussing common scientific ideas hindering further scientific progress. While starting with the provocative slogan of This Idea Must Die, the book’s chapters and their authors (for most parts) do not argue that those existing – often foundational scientific theories from various domains – are false, but instead that their widespread, and often unquestioned, utilization has started to hinder the evolution of new theories. Through this work, we would like to foster a similar discussion in our community, by suggesting six ideas in GIScience/geoinformatics that may benefit from retiring to make room for new perspectives. Our suggestions are somewhat controversial, and readers are encouraged to keep an open mind.
Article
Full-text available
In the debate about the merits or demerits of null hypothesis significance testing (NHST), authorities on both sides assume that the p value that a researcher computes is based on the null hypothesis or test hypothesis. If the assumption is true, it suggests that there are proper uses for NHST, such as distinguishing between competing directional hypotheses. And once it is admitted that there are proper uses for NHST, it makes sense to educate substantive researchers about how to use NHST properly and avoid using it improperly. From this perspective, the conclusion would be that researchers in the business and social sciences could benefit from better education pertaining to NHST. In contrast, my goal is to demonstrate that the p value that a researcher computes is not based on a hypothesis, but on a model in which the hypothesis is embedded. In turn, the distinction between hypotheses and models indicates that NHST cannot soundly be used to distinguish between competing directional hypotheses or to draw any conclusions about directional hypotheses whatsoever. Therefore, it is not clear that better education is likely to prove satisfactory. It is the temptation issue, not the education issue, that deserves to be in the forefront of NHST discussions.
Article
Full-text available
The discussions of hypothesis testing and the p value are enduring. They are, however, done in relative isolation; theorists inspect them using mathematical arguments, and applied scientists scrutinize them via experimental intuition. Most are aware that the interpretation of the p value needs to be contextual, but what does ‘‘contextual’’ mean? Here, linking examples and equations, we present a relatively comprehensive inquiry into the foundations, merits, and challenges of hypothesis testing and the p value: why they are useful and when negligence may occur. We build presentations from relatively simple history, philosophy, and cases to slightly complex statistical reasoning. We endeavor to make our language accessible and stories complementary to a broad audience; some apply those instruments frequently, some aspire to develop new methods, and perhaps all hope to one day find a cogent way to translate patterns from data into knowledge.
Article
Full-text available
In management research, fixed alpha levels in statistical testing are ubiquitous. However, in highly powered studies, they can lead to Lindley’s paradox, a situation where the null hypothesis is rejected despite evidence in the test actually supporting it. We propose a sample-size-dependent alpha level that combines the benefits of both frequentist and Bayesian statistics, enabling strict hypothesis testing with known error rates while also quantifying the evidence for a hypothesis. We offer actionable guidelines of how to implement the sample-size-dependent alpha in practice and provide an R-package and web app to implement our method for regression models. By using this approach, researchers can avoid mindless defaults and instead justify alpha as a function of sample size, thus improving the reliability of statistical analysis in management research.
Article
Full-text available
Digital game-based learning (DGBL) has the potential to promote equity in K-12 STEM education. However, few teachers have expertise in DBGL, and few professional development models exist to support teachers in both acquiring this expertise and advancing equity. To support the development of such models, we conducted a professional development to explore teacher acquisition of technological, pedagogical, and content knowledge for games (TPACK-G) during a DGBL workshop series informed by culturally relevant pedagogy. This mixed methods pilot study used pre-and post-surveys and interviews to investigate shifts in teachers' (n = 9) TPACK-G, perceptions of DGBL, and operationalizations of equity and cultural relevance. The survey findings showed increases in teachers' TPACK-G, and corroboration between the surveys and interviews showed teachers' expanded ideas about the range of applications of digital games in STEM education. However, the interviews revealed that teachers' conceptualizations of equity and cultural relevance varied considerably.
Article
The effective management of age groups in the workplace is one of the issues raised in the organizational research and execution. With the entry of young workforce into organizations and the diversity of different generations, questions have been raised about the quantity and quality of personality and behavioral differences of people of different generations. The purpose of this research is to investigate the personality differences of people born in the 50s, 60s, 70s and 80s and its implications for human resource management. For this purpose, using the 240-question NEO personality test, the required data were collected in a large sample of 36,719 people nationwide. The average difference of five personality traits between generations was investigated based on one-way analysis of variance, Sheffe's post hoc test and effect size. The results indicated that despite the presence of statistically significant differences, there is no significant difference between generational groups in terms of practical significance and effect size, and common differences are probably mainly derived from age rather than generational stereotypes. Therefore, instead of relying on generational stereotypes, it is suggested that managers and human resource specialists focus on individual differences.
Article
Статья продолжает тему предыдущей публикации авторов в февральском номере Психологических исследований за текущий год (Vol. 9, No 45), в которой была проведена критика общераспространенной в настоящее время процедуры статистического оценивания нулевой гипотезы (Null Hypothesis Statistical Testing). Во второй части статьи мы аргументируем необходимость решения проблемы на уровне социальной организации науки и описываем некоторые позитивные тенденции этого уровня в современном научном сообществе. Это (1) возможность и желательность публикации всех результатов, независимо от того, оказались ли они статистически значимы или нет; (2) возможность и желательность предварительной регистрации планируемого исследования; (3) представление результатов в удобной для метаанализа форме; (4) акцент на выводы, подтвержденные метаанализом, а не единичными экспериментами.
Article
In illicit online markets, actors are pseudonymous, legal institutions are absent, and predation is ripe. The literature proposes that problems of trust are solved by reputation systems, social ties, and administrative governance, but these are often measured independently or in single platforms. This study takes an eclectic approach, conceiving of trust as an estimate informed by any available evidence. Using transaction size as a proxy for trust I estimate the association between competing sources of trust – mediation, reputation, authentication, and social ties – and transaction value using multilevel regression. Using data from two online drug markets, I find mixed evidence that reputation and authentication are associated with transaction value, whereas results are consistent for social ties. Furthermore, transactions outside the scope of administrative mediation are generally larger. These findings have implications for future research and suggest increased attention should be given to the role of mediation practices and social ties.
Chapter
Forschung wird zunehmend aus Sicht ihrer Ergebnisse gedacht - nicht zuletzt aufgrund der Umwälzungen im System Wissensschaft. Der Band lenkt den Fokus jedoch auf diejenigen Prozesse, die Forschungsergebnisse erst ermöglichen und Wissenschaft konturieren. Dabei ist der Titel Doing Research als Verweis darauf zu verstehen, dass forschendes Handeln von spezifischen Positionierungen, partiellen Perspektiven und Suchbewegungen geformt ist. So knüpfen alle Beitragenden auf reflexive Weise an ihre jeweiligen Forschungspraktiken an. Ausgangspunkt sind Abkürzungen - die vermeintlich kleinsten Einheiten wissenschaftlicher Aushandlung und Verständigung. Der in den Erziehungs-, Sozial-, Medien- und Kunstwissenschaften verankerte Band zeichnet ein vieldimensionales Bild gegenwärtigen Forschens mit transdisziplinären Anknüpfungspunkten zwischen Digitalität und Bildung.
Article
Purpose: The purpose of this article is to explain the key reasons for the existence of statistics in doctoral-level research, why and when statistical techniques are to be used, how to statistically describe the units of analysis/samples, how to statistically describe the data collected from units of analysis/samples; how to statistically discover the relationship between variables of the research question; a step-by-step process of statistical significance/hypothesis test, tricks for selecting an appropriate statistical significance test, and most importantly which is the most user-friendly and free software for carrying out statistical analyses. In turn, guiding Ph.D. scholars to choose appropriate statistical techniques across various stages of the doctoral-level research process to ensure a high-quality research output. Design/Methodology/Approach: Postmodernism philosophical paradigm; Inductive research approach; Observation data collection method; Longitudinal data collection time frame; Qualitative data analysis. Findings/Result: As long as the Ph.D. scholars can understand i) they need NOT be an expert in Mathematics/Statistics and it is easy to learn statistics during Ph.D.; ii) the difference between measures of central tendency and dispersion; iii) the difference between association, correlation, and causation; iv) difference between null and research/alternate hypotheses; v) difference between Type I and Type II errors; vi) key drivers for choosing a statistical significance test; vi) which is the best software for carrying out statistical analyses. Scholars will be able to (on their own) choose appropriate statistical techniques across various steps of the doctoral-level research process and comfortably claim their research findings. Originality/Value: There is a vast literature about statistics, probability theory, measures of central tendency and dispersion, formulas for finding the relationship between variables, and statistical significance tests. However, only a few have explained them together comprehensively which is conceivable to Ph.D. scholars. In this article, we have attempted to explain the reasons for the existence, objectives, purposes, and essence of ‘Statistics’ briefly and comprehensively with simple examples and tricks that would eradicate fear among Ph.D. scholars about ‘Statistics’. Keywords: Research Methodology; Research Design; PhD; Ph.D.; Coursework; Doctoral Research; Statistics; Statistical Techniques; JASP; Measures of Central Tendency; Measures of Dispersion; Mean; Median; Mode; Skewness; Kurtosis; Range; Standard Deviation; Coefficient of Variation; Type 1 Error; Type 2 Error; Significance Level; Alpha; Beta; Null Hypothesis; Research Hypothesis; Alternate Hypothesis; Hypothesis Testing; Significance Testing; Statistical Significance; Descriptive Statistics; Inferential Statistics; Parametric Test; Non-parametric test; Normal Distribution; Bell Curve; Postmodernism
Chapter
Internationally, efforts promoting greater transparency and improved management strategies for conflicts of interest (COI) have gained traction in healthcare settings. This particularly pertains to the development and use of clinical practice guidelines (CPG). Mounting evidence indicates that pharmaceutical industry payments to GPG authors and developers influence clinical recommendations, including drug selection, often to benefit commercial interests and at the expense of patients. To prevent undue influence of COI and develop trustworthy CPG, authors and developing organizations should establish strict COI management policies, including full disclosure. Such policies should include details about the monetary values and funding sources of all payments and gifts from pharmaceutical companies. Authors and developers should refuse any payments or gifts while drafting CPG. CPG developers should establish clear and comprehensive COI definitions and create monitoring committees that implement COI policies, promote external review, and track COI declared by CPG authors using existing payment databases.
Chapter
The spread of misinformation and disinformation related to science and technology has impeded public and policy efforts to mitigate threats such as COVID-19 and anthropogenic climate change. In the digital age, such so-called fake science can propagate faster and capture the public imagination to a greater extent than accurate science. Therefore, ensuring the most reliable science reaches and is accepted by audiences now entails understanding the origins of fake science so that effective measures can be operationalized to recognize misinformation and inhibit its spread. In this chapter, we review the potential weaknesses of science publishing and assessment as an origin of misinformation; the interplay between science, the media, and society; and the limitations of literacy as an inoculation against misinformation; and we offer guidance on the most effective ways to frame science to engage non-expert audiences. We conclude by offering avenues for future science communication research.
Poster
Full-text available
Student Perceptions of the Fifth Annual Greenhand Leadership Conference at Auburn University
Article
Full-text available
This article examines current research methodology in psychology in the context of Serlin and Lapsley's response to Meehl's critiques of the scientific practices of psychologists. The argument is made that Serlin and Lapsley's appeal to Lakatos's philosophy of science to defend the rationality of null hypothesis tests and related practices misrepresents that philosophy. It is demonstrated that Lakatos in fact considered psychology an extremely poor science lacking true research programs, an opinion very much in line with Meehl's critique. The present essay speculates on the reasons for Lakatos's negative opinion and reexamines the role of null hypothesis tests in relation to the quality of theories in psychology. It is concluded that null hypothesis tests are destructive to theory building and directly related to Meehl's observation of slow progress in soft psychology.
Article
Full-text available
Discusses editorial policy of the Journal of Educational Psychology with respect to substantive, procedural, and ethical issues. Research should make substantive contributions and conceptually integrate recent developments bearing on the topic. Procedural concerns involve terminological clarity, confounding and controlled factors, the effects of a laboratory orientation on classroom research, and the appropriateness of a statistical analysis. Ethical issues include piecemeal and duplicate publications, plagiarism, and falsification/fabrication of data. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Recent developments in procedures for conducting pairwise multiple comparisons of means prompted an empirical investigation of several competing techniques. Monte Carlo results revealed that the newer multistage sequential procedures maintain their familywise Type I error probabilities while exhibiting power that is superior to the traditional competitors. Of all procedures examined, the modified E. Peritz (1970) procedure (M. A. Seaman et al, 1990) is generally the most powerful according to all definitions of power. At the same time, when computational ease and convenience are taken into consideration, A. J. Hayter's (1986) procedure should be regarded as a viable alternative. Beyond pairwise comparisons of means, the versatile S. Holm (1979) procedure and its modifications (J. P. Shaffer, 1986) are very attractive insofar as they represent simple, yet powerful, data-analytic tools for behavioral researchers. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
The multiple-comparison procedure originally proposed by R. A. Fisher (1935) for the 1-way ANOVA context has several desirable properties when K (the number of groups) is equal to 3. In this article, the logic of the procedure is described in conjunction with those properties. A discussion follows of how the Fisher procedure can be similarly applied in a number of other K = 3 (and, more generally, 2-degree-of-freedom) hypothesis-testing situations. Finally, the Fisher logic is combined with recent sequential applications of the Bonferroni inequality to illustrate the utility and versatility of that combination for the applied researcher. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Chapter
This chapter provides me with the opportunity to discuss a number of methodological and statistical “bugs” that I have detected creeping into psychological research in general, and into research on children’s learning in particular. Naturally, one cannot hope to exterminate all such bugs with but a single essay. Rather, it is hoped that this chapter will leave a trail of pellets that is sufficiently odorific to get to the source of these potentially destructive little creatures. It also goes without saying that different people in this trade have different entomological lists that they would like to see presented. Although all cannot be presented here, I intend to introduce you to nearly 20 of my own personal favorites. At the same time, it must be stated at the outset that present space limitations do not permit a complete specification and resolution of the problems that these omnipresent bugs can create for cognitive-developmental researchers. Consequently, in most cases I will only allude to a problem and its potential remedies, placing the motivation for additional inquiry squarely in the lap of the curious reader.
Article
This article examines current research methodology in psychology in the context of Serlin and Lapsley's response to Meehl's critiques of the scientific practices of psychologists. The argument is made that Serlin and Lapsley's appeal to Lakatos's philosophy of science to defend the rationality of null hypothesis tests and related practices misrepresents that philosophy. It is demonstrated that Lakatos in fact considered psychology an extremely poor science lacking true research programs, an opinion very much in line with Meehl's critique. The present essay speculates on the reasons for Lakatos's negative opinion and reexamines the role of null hypothesis tests in relation to the quality of theories in psychology. It is concluded that null hypothesis tests are destructive to theory building and directly related to Meehl's observation of slow progress in soft psychology.
Article
Magnitude-of-effect (ME) statistics, when adequately understood and correctly used, are important aids for researchers who do not want to place a sole reliance on tests of statistical significance in substantive result interpretation. We describe why methodologists encourage the use of ME indices as interpretation aids and discuss different types of ME estimates. We discuss correction formulas developed to attenuate statistical bias in ME estimates and illustrate the effect these formulas have on different sample and effect sizes. Finally, we discuss several cautions against the indiscriminate use of these statistics and offer reasons why ME statistics, like all substantive result interpretation aids, are useful only when their strengths and limitations are understood by researchers.
Article
Three of the various criticisms of conventional uses of statistical significance testing are elaborated. Three alternatives for augmenting statistical significance tests in interpreting results are then elaborated. These include emphasizing effect sizes, evaluating statistical significance tests in a sample size context, and evaluating result replicability. Ways of estimating result replicability from data in hand include crossvalidation, jackknife, and bootstrap logics. The bootstrap is explored in some detail.
Article
Textbook discussion of statistical testing is the topic of interest. Some 28 books published from 1910 to 1949, 19 books published from 1990 to 1992, plus five multiple-edition books were reviewed in terms of presentations of statistical testing. It was of interest to discover textbook coverage of the P-value (i.e., Fisher) and fixed-alpha (i.e., Neyman-Pearson) approaches to statistical testing. Also of interest in the review were some issues and concerns related to the practice and teaching of statistical testing: (a) levels of significance, (b) importance of effects, (c) statistical power and sample size, and (d) multiple testing. It is concluded that it is not statistical testing itself that is at fault; rather, some of the textbook presentation, teaching practices, and journal editorial reviewing may be questioned.
Article
Based on principles of modern philosophy of science, it can be concluded that it is the magnitude of a population effect that is the essential quantity to examine in determining support or lack of support for a theoretical prediction. To test for theoretical support, the corresponding statistical null hypothesis must be derived from the theoretical prediction, which means that we must specify and test a range null hypothesis. Similarly, confidence intervals based on range null hypotheses are required. Certain of the newer multiple comparison procedures are discussed in terms of their applicability to the problem of generating confidence intervals based on range null hypotheses to control the familywise Type I error rate in multiple-sample experiments.
Article
A test of statistical significance addresses the question, How likely is a result, assuming the null hypotheses to be true. Randomness, a central assumption underlying commonly used tests of statistical significance, is rarely attained, and the effects of its absence rarely acknowledged. Statistical significance does not speak to the probability that the null hypothesis or an alternative hypothesis is true or false, to the probability that a result would be replicated, or to treatment effects, nor is it a valid indicator of the magnitude or the importance of a result. The persistence of statistical significance testing is due to many subtle factors. Journal editors are not to blame, but as publishing gatekeepers they could diminish its dysfunctional use.
Article
This editorial introduces the current issue of the Journal of Educational Psychology. This issue embarks on a year-long celebration through which we recognize 100 years of our parent organization, the American Psychological Association, in the business of psychology. The Journal's contribution to the celebration is the following: In each 1992 issue, we will publish one or more special articles, special in the sense that these articles have been solicited rather than submitted and in the sense that they reflect on selected aspects of the science and status of educational psychology as we approach the 21st century. In addition, each issue will contain a centennial-year surprise—a reprinted article from an earlier issue of the Journal. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In this editorial, the editor explains what one should expect from a journal like that of the Journal of Educational Psychology. Just as the Journal has a long and distinguished tradition, so too does the field of educational psychology. This is something that we plan to bring to your attention in 1992 when the American Psychological Association hosts a year-long celebration of its 100th birthday. Throughout the centennial year, the Journal will feature several prominent researchers' retrospectives, prospectives, and perspectives on topics and issues that are central to the discipline of educational psychology. In the meantime, the Journal is receptive to educational-psychological research stemming from a wide variety of methodological and statistical approaches, assuming that such approaches are applied with care and precision. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Historical origins of statistical testing practices: The treatment of Fisher versus Neyman-Pearson views in textbooks
  • C J Huberty
Huberty, C. J (1993). Historical origins of statistical testing practices: The treatment of Fisher versus Neyman-Pearson views in textbooks. Journal of Experimental Education, 61, 317-333.