Effect Sizes for Research: A Broad Practical Approach
Abstract
The goal of this book is to inform a broad readership about a variety of measures and estimators of effect sizes for research, their proper applications and interpretations, and their limitations. Its focus is on analyzing post-research results. The book provides an evenhanded account of controversial issues in the field, such as the role of significance testing. Consistent with the trend toward greater use of robust statistical methods, the book pays much attention to the statistical assumptions of the methods and to robust measures of effect size.
... The statements and results of the comparison are summarized in Table 3. To conduct this analysis, we used the Wilcoxon Rank Sum Test [Whitley and Ball, 2002] and the Cliff's Delta (d) measure [Grissom and Kim, 2005] to determine the level of agreement with each statement between the previously defined groups. The Cliff's Delta (d) measure [Grissom and Kim, 2005] quantifies the strength of the difference between groups. ...
... To conduct this analysis, we used the Wilcoxon Rank Sum Test [Whitley and Ball, 2002] and the Cliff's Delta (d) measure [Grissom and Kim, 2005] to determine the level of agreement with each statement between the previously defined groups. The Cliff's Delta (d) measure [Grissom and Kim, 2005] quantifies the strength of the difference between groups. For instance, how strong is the difference between developers with direct experience in data privacy and those with indirect experience for the statement [S1]. ...
... Specifically, for RQ1, we evaluated the data distribution before applying correlation tests to mitigate potential biases in the statistical analysis. We employed the Wilcoxon Rank Sum Test [Whitley and Ball, 2002] and Cliff's Delta (d) [Grissom and Kim, 2005] to assess the level of agreement with each statement between the predefined groups. Additionally, we used the conventional p-value threshold of 0.05 to establish statistical significance. ...
Data privacy is an essential principle of information security, aimed at protecting sensitive data from unauthorized access and information leaks. As software systems advance, the volume of personal information also grows exponentially. Therefore, incorporating privacy engineering practices during development is vital to ensure data integrity, confidentiality, and compliance with legal regulations, such as the General Data Protection Regulation (GDPR). However, there is a gap in understanding developers' awareness of data privacy, their perceptions of the implementation of privacy strategies, and the influence of organizational factors on this adoption. Thus, this paper aims to explore the level of awareness among Brazilian developers regarding data privacy and their perceptions of the implementation strategies adopted to ensure data privacy. Additionally, we seek to understand how organizational factors influence the adoption of data privacy practices. To this end, we surveyed 88 Brazilian developers with privacy-related work experience. We got 21 statements grouped into three topics to measure the Brazilian developers' awareness of data privacy in software. Our statistical analysis reveals substantial gaps between groups, e.g., developers have Direct v.s. Indirect data privacy-related work experience. We also reveal some data privacy strategies, e.g., Encryption, are both widely used and perceived as highly important, others, such as Turning off data collection, highlight strategies where ease of use does not necessarily lead to widespread adoption. Finally, we identified that the absence of dedicated privacy teams correlates with a lower perceived priority and less investment in tools. Even in organizations that recognize the importance of privacy. Our findings offer insights into how Brazilian developers perceive and implement data privacy practices, emphasizing the critical role organizational culture plays in decision-making regarding privacy. We hope that our findings will contribute to improving privacy practices within the software development community, particularly in contexts similar to Brazil.
... Due to the small sample size, a paired t test with Hedges' correction was used to determine the differences between the pre-and post-SBE EPIQ scores. Hedges' correction provides a better estimate for smaller sample sizes (Grissom & Kim, 2005). To determine associations between experience and SBE roles with perceived competency ratings on the EPIQ, the Mann-Whitney U test was used because there were unequal numbers for those with prior disaster experience. ...
... This analytical method was chosen due to the limited sample size (n = 41). Effect sizes with Hedges' corrections (g = 95%) were included because Hedges provides a better estimate when smaller samples are used (Grissom & Kim, 2005). Of the 44 learners included in this study, 41 completed both the pre-and posttest EPIQ and provided answers to the open-ended questions, resulting in a 93% response rate. ...
Background
Nursing education often relegates disaster training to didactic, resulting in nurses who lack the competencies needed for mass-casualty responses.
Method
In this quasiexperimental exploratory study, senior nursing students (n = 44) rated disaster management competency using the Emergency Preparedness Inventory Questionnaire (EPIQ) before and after simulation-based education (SBE) and also provided insight that was analyzed for themes.
Results
Paired samples t tests demonstrated significant differences with decreased overall familiarity reported post-SBE versus pre-SBE (µ = 1.15, SD = 1.04, CI = 0.82, p < .001). Themes included real-world, communication, empathy, and roles. True to Dunning-Kruger effect, post-SBE scores decreased from baseline, suggesting SBE enhanced students' ability to evaluate competency levels. Exploration of themes emphasized gains in empathy and communication.
Conclusion
Providing disaster SBE allows learners to better evaluate their competencies and learn how disasters affect patients and their families. Nurse educators should scaffold disaster management SBE throughout the curriculum to facilitate transition to practice. [J Nurs Educ. 2025;64(4):217–226.]
... We also perform statistical hypothesis tests (Wilcoxon signed-rank test) [55] and Cliff's delta effect size [56] to compare the distributions of the BLEU-4, METEOR, and ROUGE-L of the predictions generated by the different models trained on the filtered training sets with those of the models trained on the full training sets. We use Holm's correction [57] to adjust the p-values for the multiple tests. ...
... We compare the effectiveness of the models trained with the training instances selected with SIDE 0.9 and Random measured in terms of the previously described metrics (i.e., BLEU-4, METEOR, and ROUGE-L). Again, we perform statistical hypothesis tests (Wilcoxon signed-rank test) [55] and compute the Cliff's delta effect size [56] to compare the distributions of BLEU-4, METEOR, and ROUGE-L of the predictions generated by the SIDE 0.9 model and the Random baseline. We use Holm's correction [57] to adjust the p-values for the multiple tests. ...
Automated code summarization is a long-standing goal for code comprehension. This task automatically generates documentation using a given method. Deep Learning (DL)-based approaches have been proven beneficial for various software engineering (SE) tasks, including this one. Most state-of-the-art datasets for code summarization are automatically mined from GitHub and, thus, might contain erroneous or sub-optimal examples. Previous work showed that using a simple rule-based approach for removing noisy instances allows for a tangible reduction of the training set size while not reducing the effectiveness of the trained models. Motivated by this finding, we conjecture that it is possible to further reduce the dataset size by removing instances that contain different issues. In this paper, we explore the extent to which code-comment coherence, a specific quality attribute of code summaries, can be used to optimize code summarization datasets. Specifically, we hypothesize that removing incoherent code-comment pairs might positively impact the effectiveness of the models. To do this, we rely on SIDE, a recently introduced metric for code-summary coherence. We examine multiple selectivity levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets. The results show that even halving the training set sizes does not significantly affect the model's ability to generate summaries. However, when comparing the most restrictive selection strategy with a simpler one that randomly selects the training instances, we observe that the resulting accuracy of the model also does not change. This result suggests that (i) current datasets contain many irrelevant examples, and (ii) different quality attributes should be explored for optimizing code summarization datasets.
... The FFQ was structured into 11 sections that partly reflect the sequence of foods throughout the day and include foods with similar characteristics: cereals, meat, fish, milk and dairy products, vegetables, legumes, fruits, miscellaneous foods, water and alcoholic beverages, olive oil and other edible fats, coffee/sugar, and salt. In a subsequent step, the FFQ was validated against the dietary records, and the results were reviewed to make necessary changes to the questionnaire [109]. The initial FFQ consisted of 85 food items and included questions regarding fat consumption. ...
... Differences in the prevalence of exposure groups, i.e., physical frailty and other categorical variables, and their 95% CIs were calculated and used to assess significant differences in the magnitude of the association, i.e., effect size (ES). Using Wilcoxon's effect size, differences between continuous variables were calculated by confidence intervals around them following a non parametric approach [109]. Differences between categorical variables were assessed using differences in prevalence and confidence intervals around them. ...
Artificial Intelligence (AI) is increasingly recognized as a transformative force in healthcare, offering unprecedented opportunities to enhance disease diagnosis, management, and prevention. This PhD thesis is rooted in two fundamental research areas: the application of AI to health and epidemiological data for the purposes of disease prevention and monitoring, and the utilization of AI techniques for the analysis of bioelectrical signals to support clinical decision-making.
The first research area delves into the sophisticated analysis of extensive health and epidemiological datasets using cutting-edge machine learning (ML) methodologies. The objective is to uncover significant patterns that can inform and improve the prevention and management of chronic diseases. By identifying these patterns, the research enables the creation of personalized intervention strategies tailored to individual patient profiles, while also optimizing disease management on a broader, population-wide scale. This approach not only contributes to the advancement of public health but also sets the stage for more proactive healthcare practices.
The second research focus of this thesis explores the development and application of advanced ML and deep learning (DL) models for the interpretation of bioelectrical signals, such as electroencephalograms (EEG), electrocardiograms (ECG), and electromyograms (EMG). It is important to point out that non-invasive technologies such as brain-computer interfaces (BCIs) were used for the analysis of EEG signals.
The AI-driven models developed in this PhD thesis aim to enhance the accuracy and reliability of medical diagnostics, facilitating more precise and personalized clinical decisions. The integration of these models into clinical workflows has the potential to revolutionize patient care by providing healthcare professionals with powerful tools for diagnosis and treatment planning.
The practical outcomes of this research are profound, offering novel tools and frameworks that bridge the gap between AI innovation and clinical application. By incorporating explainable Artificial Intelligence (XAI) principles, the models developed in this thesis are designed to be transparent and interpretable, ensuring that healthcare professionals can trust and effectively use these advanced technologies in their daily practice.
In summary, this PhD thesis makes significant contributions to the intersection of AI and medicine, addressing key challenges in the interpretation of health and epidemiological data as well as the analysis of bioelectrical signals. The findings presented here lay a robust foundation for future advancements in personalized medicine and public health, ultimately aiming to improve patient outcomes and the overall efficacy of healthcare systems.
All contributions made in this thesis are detailed in the respective chapters, providing a comprehensive overview of the research conducted and its impact on the field of AI in healthcare.
... This is often done using effect sizes, which quantify the strength of an observed effect. Several effect sizes have been proposed, with the choice depending on whether the variables are qualitative or quantitative (Fernández-Castilla et al. 2024;Grissom and Kim 2005), and their interpretation may vary across research fields (Funder and Ozer 2019). ...
Interest in understanding creativity through Programme for International Student Assessment (PISA) data is on the rise, yet researchers face methodological challenges in synthesizing findings across various constructs, measures, and datasets. Meta‐analysis—a valuable methodology for synthesizing quantitative data—remains underutilized in creativity research involving large‐scale assessments like PISA. This paper provides guidelines for applying meta‐analytic techniques to PISA creative thinking assessment data to help researchers address these challenges. It introduces meta‐analysis by outlining its definition and advantages, followed by key steps and methodological considerations for synthesizing bivariate and multivariate relationships within PISA. Finally, the paper discusses techniques for managing the computational complexity of meta‐analyzing PISA data. Ultimately, these guidelines aim to support researchers in effectively synthesizing PISA data to advance the study of creativity.
... While the statistical tests allow checking for the presence of significant differences, they do not provide any information about the magnitude of such differences. Therefore, we used non-parametric Cliff's delta (|d|) effect size [53]. The effect size was considered small for 0.148 ≤ |d| < 0.33, medium for 0.33 ≤ |d| < 0.474, and large for |d| ≥ 0.474. ...
Ensuring equitable access to web-based visual content in Science, Technology, Engineering, and Mathematics (STEM) disciplines remains a significant challenge for visually impaired users. This preliminary study explores the use of Large Language Models (LLMs) to automatically generate high-quality alternative texts for complex web images in these domains, contributing to the development of an accessibility tool. First, we analyzed the outputs of various LLM-based image-captioning systems, selected the most suitable one (Gemini), and developed a browser extension, AlternAtIve, capable of generating alternative descriptions at varying verbosity levels. To evaluate AlternAtIve, we assessed its perceived usefulness in a study involving 35 participants, including a blind user. Additionally, we manually compared the quality of the outputs generated by AlternAtIve with those provided by two state-of-the-practice tools from the Google Web Store, using a custom metric that computes the quality of the descriptions considering their correctness, usefulness, and completeness. The results show that the descriptions generated with AlternAtIve achieved high quality scores, almost always better than those of the other two tools. Although conveying the meaning of complex images to visually impaired users through descriptions remains challenging, the findings suggest that AI-based tools, such as AlternAtIve, can significantly improve the web navigation experience for screen reader users.
... Standardised effect sizes, represented as Cohen's d or Hedges's g, were used to compute the effect size [49]. In this study, Cohen's d was used for the effect size calculation, with a significance level set at 95%. ...
Background
Nurses are among the most interactive professionals in society, providing both care and psychosocial and psychotherapeutic interventions for individuals with mental health issues. Nurse‐centred approaches have demonstrated a positive impact on addressing mental health issues.
Aim
This study aimed to evaluate the effectiveness of psychosocial and psychotherapeutic interventions administered by nurses in managing mental health issues.
Methods
For this meta‐analysis, data were gathered through a comprehensive search of databases, including PubMed, Web of Science, EBSCOhost, Google Scholar and the YÖK Thesis Center, with no restrictions on publication year. Following the review process, a total of 25 studies were included in the analysis. The PRISMA checklist was employed to guide our study.
Results
The meta‐analysis revealed that interventions implemented by nurses in the context of mental health were effective. Specifically, the psychosocial and psychotherapeutic interventions significantly reduced depression levels in individuals (SMD: −0.918, 95% CI: −1.350 to −0.486; Z = −4.161; p < 0.05) and anxiety levels (SMD: −0.61, 95% CI: −1.190 to −0.029; Z = −2.59; p < 0.05). Additionally, these interventions were associated with an improvement in individuals' quality of life (SMD: 0.673, 95% CI: 0.303–1.403; Z = 3.56; p < 0.05).
Conclusion
Nurses can play a pivotal role in addressing mental health issues, effectively contributing to the reduction of anxiety and depression levels while enhancing the overall quality of life for individuals.
... We report generalised η 2 , in addition to the more commonly provided partial η 2 , as it offers an estimate that is more comparable across within-and between-subject designs (Lakens, 2013;Preacher & Kelley, 2011). In order to offer a more intuitively understandable indication of effect size, we also computed the common language effect size (Grissom & Kim, 2014;McGraw & Wong, 1992). In within-subjects designs, the common language effect size reflects the probability that an individual has a higher score at one assessment point than the other (Lakens, 2013), in between-subjects designs it reflects the probability that a randomly sampled participant from one group has a higher value than a randomly sampled participant from the other. ...
Objectives
Mindfulness training has been shown to be helpful in the treatment of depression. However, standard mindfulness-based interventions (MBIs) are still not widely available and sustaining a mindfulness practice can be difficult for patients with current depression. Alternative delivery formats may serve to address these problems. This study tested the acceptability, practicality and preliminary efficacy of a blended individual mindfulness-based intervention that supports patients in their practice individually and beyond the duration of standard mindfulness-based interventions.
Method
Thirty-nine patients with persistent depression were entered into the study and supported to practice with either standard length practices (30 min once a day) or shorter, more frequent practices (15 min twice a day) over 12 weeks with minimal therapist support (6 sessions of 30 min duration). Symptoms were assessed at the beginning, and after each third of the intervention. Engagement in practice was monitored throughout and qualitative interviews were conducted post-intervention. Data from 24 service users who waited for depression treatment were collected for benchmarking purposes.
Results
Of those randomised, 24 (62%) completed the intervention. Completers engaged in an average of 89% of formal practices and showed reductions in symptoms of high effect size, ηp² = 0.66, 90% CI [0.54, 0.73], with 75% of completers moving to symptoms levels below the clinical threshold. Thematic analysis of feedback from completers indicated high acceptability but highlighted the need for longer therapist sessions.
Conclusions
Blended individual mindfulness-based interventions have promise for supporting depressed patients to engage in mindfulness practice and reduce symptoms while aligning well with current trends in service delivery. However, adjustments to the current intervention in line with patient suggestions, including more time for individual therapist sessions to bring them closer to the standard length of psychotherapy sessions, are needed to reduce drop-out and underlying practicality problems.
Preregistration
This study was pre-registered on ClinicalTrials.gov (ClinicalTrials.gov ID: NCT04576741).
... A nonparametric counterpart of the CLES [20] is the probability of superiority (PS [28]) estimator, which estimates the same population effect (Peng & Chen, 2014). Because the PS [28] is not derived under normality and equality of variances assumptions like the CLES [20], it can be applied more broadly (Grissom & Kim, 2005). The PS [28] can be computed using the Mann-Whitney U statistic, which indicates how often a randomly selected observation from one sample has a larger value than a randomly selected observation from the other sample. ...
The prevalence of effect-size (ES) reporting has risen significantly, yet studies comparing two groups tend to rely exclusively on the Cohen’s d family of ESs. In this article, we aim to broaden the readers’ horizon of known ESs by introducing various ES families for two-group comparisons, including indices of standardized differences in central tendency, overlap, dominance, and differences in variability and distributional tails. We describe parametric and nonparametric estimators in each ES family and present an interactive web application (R Shiny) for computing these ESs and facilitating their application. This one-stop calculator allows for the computation of 95 applications of 67 unique ESs and their confidence intervals and various plotting options and provides detailed descriptions for each ES, making it a valuable resource for both self-guided exploration and instructor-led teaching. With this comprehensive guide and its companion app, we aim to improve the clarity and accuracy of ES reporting in research design that involves two-group comparisons.
... We statistically compare the EM predictions generated by the models using again McNemar's test [39] with the OR effect size. As for the CrystalBLEU, we use the paired Wilcoxon signed-rank test [43] and the Cliff's delta [44] effect size. Also in this case we account for multiple tests by adjusting p-values using the Benjamini-Hochberg procedure [40]. ...
Deep Learning-based code generators have seen significant advancements in recent years. Tools such as GitHub Copilot are used by thousands of developers with the main promise of a boost in productivity. However, researchers have recently questioned their impact on code quality showing, for example, that code generated by DL-based tools may be affected by security vulnerabilities. Since DL models are trained on large code corpora, one may conjecture that low-quality code they output is the result of low-quality code they have seen during training. However, there is very little empirical evidence documenting this phenomenon. Indeed, most of previous work look at the frequency with which commercial code generators recommend low-quality code without the possibility of relating this to their training set. We investigate the extent to which low-quality code instances seen during training affect the quality of the code generated at inference time. We start by fine-tuning a pre-trained DL model on a large-scale dataset being representative of those usually adopted in the training of code generators. We show that 4.98% of functions in this dataset exhibit one or more quality issues related to security, maintainability, best practices, etc. We use the fine-tuned model to generate 551k Python functions, showing that 5.85% of them are affected by at least one quality issue. We then remove from the training set the low-quality functions, and use the cleaned dataset to fine-tune a second model which has been used to generate the same 551k Python functions. We show that the model trained on the cleaned dataset exhibits similar performance in terms of functional correctness as compared to the original model while, however, generating a statistically significant lower number of low-quality functions (2.16%). Our study empirically documents the importance of high-quality training data for code generators.
... Gemini also achieved a substantial improvement of 441% = (21.1% − 3.9%)∕3.9% . We performed the Wilcoxon signed-rank test (Wilcoxon 1945) and used Cliff's Delta (d) as the effect size (Grissom and Kim 2005) to validate whether explicitly specifying expected refactoring types significantly improved the success rate. The test results (p-value = 3.06E−15, Cliff's |d| = 0.37 ) confirmed that the improvement of GPT was statistically significant. ...
Software refactoring is an essential activity for improving the readability, maintainability, and reusability of software projects. To this end, a large number of automated or semi-automated approaches/tools have been proposed to locate poorly designed code, recommend refactoring solutions, and conduct specified refactorings. However, even equipped with such tools, it remains challenging for developers to decide where and what kind of refactorings should be applied. Recent advances in deep learning techniques, especially in large language models (LLMs), make it potentially feasible to automatically refactor source code with LLMs. However, it remains unclear how well LLMs perform compared to human experts in conducting refactorings automatically and accurately. To fill this gap, in this paper, we conduct an empirical study to investigate the potential of LLMs in automated software refactoring, focusing on the identification of refactoring opportunities and the recommendation of refactoring solutions. We first construct a high-quality refactoring dataset comprising 180 real-world refactorings from 20 projects, and conduct the empirical study on the dataset. With the to-be-refactored Java documents as input, ChatGPT and Gemini identified only 28 and 7 respectively out of the 180 refactoring opportunities. The evaluation results suggested that the performance of LLMs in identifying refactoring opportunities is generally low and remains an open problem. However, explaining the expected refactoring subcategories and narrowing the search space in the prompts substantially increased the success rate of ChatGPT from 15.6 to 86.7%. Concerning the recommendation of refactoring solutions, ChatGPT recommended 176 refactoring solutions for the 180 refactorings, and 63.6% of the recommended solutions were comparable to (even better than) those constructed by human experts. However, 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors, which indicate the risk of LLM-based refactoring.
... Finally, we assessed whether there are statistically significant differences in performance between fully finetuned models and QLoRA-optimized models. To this extent, we employed the Wilcoxon signed-rank test [77], and measured the effect size using Cliff's Delta (d) [78]. The effect sizes are categorized as follows: negligible if |d| < 0.10, small if 0.10 ≤ |d| < 0.33, medium if 0.33 ≤ |d| < 0.474, and large if |d| ≥ 0.474. ...
Code Language Models (CLMs) have demonstrated high effectiveness in automating software engineering tasks such as bug fixing, code generation, and code documentation. This progress has been driven by the scaling of large models, ranging from millions to trillions of parameters (e.g., GPT-4). However, as models grow in scale, sustainability concerns emerge, as they are extremely resource-intensive, highlighting the need for efficient, environmentally conscious solutions. GreenAI techniques, such as QLoRA (Quantized Low-Rank Adaptation), offer a promising path for dealing with large models' sustainability as they enable resource-efficient model fine-tuning. Previous research has shown the effectiveness of QLoRA in code-related tasks, particularly those involving natural language inputs and code as the target output (NL-to-Code), such as code generation. However, no studies have explored its application to tasks that are fundamentally similar to NL-to-Code (natural language to code) but operate in the opposite direction, such as code summarization. This leaves a gap in understanding how well QLoRA can generalize to Code-to-NL tasks, which are equally important for supporting developers in understanding and maintaining code. To address this gap, we investigate the extent to which QLoRA's capabilities in NL-to-Code tasks can be leveraged and transferred to code summarization, one representative Code-to-NL task. Our study evaluates two state-of-the-art CLMs (CodeLlama and DeepSeek-Coder) across two programming languages: Python and Java. Our research tasked models with generating descriptions for Python and Java code methods. The results align with prior findings on QLoRA for source code generation, showing that QLoRA enables efficient fine-tuning of CLMs for code summarization.
... The test is paired since the compared distributions, with versus without normalization, are related to the identical objects (i.e., the score of the same ten classifiers, ten classifiers, on the same 71 datasets). We also use the Cliff's delta (paired) to analyze the effect size [153]. In order to interpret the Cliff's delta (paired) effect size, we used the following standard interpretation [392]: ...
Context. Developing secure and reliable software remains a key challenge in software engineering (SE). The ever-evolving technological landscape offers both opportunities and threats, creating a dynamic space where chaos and order compete. Secure software engineering (SSE) must continuously address vulnerabilities that endanger software systems and carry broader socio-economic risks, such as compromising critical national infrastructure and causing significant financial losses. Researchers and practitioners have explored methodologies like Static Application Security Testing Tools (SASTTs) and artificial intelligence (AI) approaches, including machine learning (ML) and large language models (LLMs), to detect and mitigate these vulnerabilities. Each method has unique strengths and limitations. Aim. This thesis seeks to bring order to the chaos in SSE by addressing domain-specific differences that impact AI accuracy. Methodology. The research employs a mix of empirical strategies, such as evaluating effort-aware metrics, analyzing SASTTs, conducting method-level analysis, and leveraging evidence-based techniques like systematic dataset reviews. These approaches help characterize vulnerability prediction datasets. Results. Key findings include limitations in static analysis tools for identifying vulnerabilities, gaps in SASTT coverage of vulnerability types, weak relationships among vulnerability severity scores, improved defect prediction accuracy using just-in-time modeling, and threats posed by untouched methods. Conclusions. This thesis highlights the complexity of SSE and the importance of contextual knowledge in improving AI-driven vulnerability and defect prediction. The comprehensive analysis advances effective prediction models, benefiting both researchers and practitioners.
... Effect size measures are needed to analyze this. For a nonparametric effect size measure, we use Vargha and Delaney's 12 [30,31]. 12 measures the probability that running one algorithm yields higher values than running another algorithm. If the two algorithms are equivalent, then A 12 will be 0.5. ...
The development and maintenance of video games present unique challenges that differentiate them from Classic Software Engineering (CSE) such as the increased difficulty in locating bugs within video games. This distinction has given rise to Game Software Engineering (GSE), a subfield that intersects software engineering and video games. Our work proposes a novel way for bug localization in video games by evolving simulations via an evolutionary algorithm, which helps to explore the large number of possible simulations. Simulations generate data (i.e., traces) from the behavior of non-player characters (NPCs). NPCs are not controlled by the player and are key components of video games. We hypothesize that such traces can be instrumental in locating bugs. Our approach automatically locates potential buggy model elements from traces. Furthermore, we propose a novel way of applying genetic operations to evolve simulations by selectively combining their components, rather than combining all components as a whole. We evaluate our approach in the commercial video game Kromaia, and the results indicate that evolving simulations using our novel component-specific genetic operations boosts bug localization. Specifically, our approach improved the F-measure for all bug categories over randomly combining all components, the baseline (which focuses on CSE and utilizes bug reports), and Random Search by 7.93%, 27.17%, and 46.34%, respectively. This work opens a new research direction for further exploration in bug localization within GSE and potentially in CSE as well. Moreover, it encourages other researchers to explore alternative genetic operations rather than selecting them by default.
... The test is paired since the compared distributions, with versus without normalization, are related to the identical objects (i.e., the score of the same ten classifiers, ten classifiers, on the same 71 datasets). We also use the Cliff's delta (paired) to analyze the effect size [153]. In order to interpret the Cliff's delta (paired) effect size, we used the following standard interpretation [392]: ...
Context: Developing secure and reliable software is an enduring challenge in
software engineering (SE). The current evolving landscape of technology brings
myriad opportunities and threats, creating a dynamic environment where chaos
and order vie for dominance. Secure software engineering (SSE) faces the con-
tinuous challenge of addressing vulnerabilities that threaten the security of soft-
ware systems and have broader socio-economic implications, as they can endan-
ger critical national infrastructure and cause significant financial losses. Re-
searchers and practitioners investigated methodologies such as Static Applica-
tion Security Testing Tools (SASTTs) and artificial intelligence (AI) such as
machine learning (ML) and large language models (LLM) to identify and miti-
gate these vulnerabilities, each possessing unique advantages and limitations.
Aim: In this thesis, we aim to bring order to the chaos caused by the haphaz-
ard usage of AI in SSE contexts without considering the differences that specific
domain holds and can impact the accuracy of AI.
Methodology: Our Methodology features a mix of empirical strategies to
evaluate effort-aware metrics, analysis of SASTTs, method-level analysis, and
evidence-based strategies, such as systematic dataset review, to characterize vul-
nerability prediction datasets.
Results: Our main results include insights into the limitations of current
static analysis tools in identifying software vulnerabilities effectively, such as the
identification of gaps in the coverage of SASTTs regarding vulnerability types,
the scarce relationship among vulnerability severity scores, an increase in de-
fect prediction accuracy by leveraging just-in-time modeling, and the threats of
untouched methods.
Conclusions: In conclusion, this thesis highlights the complexity of SSE
and the potential of in-depth context knowledge in enhancing the accuracy of AI
in vulnerability and defect prediction methodologies. Our comprehensive analysis
contributes to the adoption and research on the effectiveness of prediction models
benefiting practitioners and researchers.
... To show the effect size of the difference, we used Cliff's delta. Following the guidelines of previous work[16,28,36], we interpreted the effect size as small for 0.147 < d < 0.33, medium for 0.33 ≤ d < 0.474, and large for d ≥ 0.474. ...
Open source is experiencing a renaissance period, due to the appearance of modern platforms and workflows for developing and maintaining public code. As a result, developers are creating open source software at speeds never seen before. Consequently, these projects are also facing unprecedented mortality rates. To better understand the reasons for the failure of modern open source projects, this paper describes the results of a survey with the maintainers of 104 popular GitHub systems that have been deprecated. We provide a set of nine reasons for the failure of these open source projects. We also show that some maintenance practices -- specifically the adoption of contributing guidelines and continuous integration -- have an important association with a project failure or success. Finally, we discuss and reveal the principal strategies developers have tried to overcome the failure of the studied projects.
... Results are declared as statistically significant at a 0.05 significance level. We also estimate the magnitude of the observed differences using the Cliff's Delta (d), which allows for a nonparametric effect size measure for ordinal data [105]. ...
It is common practice for developers of user-facing software to transform a mock-up of a graphical user interface (GUI) into code. This process takes place both at an application's inception and in an evolutionary context as GUI changes keep pace with evolving features. Unfortunately, this practice is challenging and time-consuming. In this paper, we present an approach that automates this process by enabling accurate prototyping of GUIs via three tasks: detection, classification, and assembly. First, logical components of a GUI are detected from a mock-up artifact using either computer vision techniques or mock-up metadata. Then, software repository mining, automated dynamic analysis, and deep convolutional neural networks are utilized to accurately classify GUI-components into domain-specific types (e.g., toggle-button). Finally, a data-driven, K-nearest-neighbors algorithm generates a suitable hierarchical GUI structure from which a prototype application can be automatically assembled. We implemented this approach for Android in a system called ReDraw. Our evaluation illustrates that ReDraw achieves an average GUI-component classification accuracy of 91% and assembles prototype applications that closely mirror target mock-ups in terms of visual affinity while exhibiting reasonable code structure. Interviews with industrial practitioners illustrate ReDraw's potential to improve real development workflows.
... Then the output of the structural model from the SEM-PLS using these criteria: (1) R-squared or adjusted: R2 0.70, 0.45, and 0.25 indicate strong, moderate, and weak, respectively; (2) Effect size (f2): 0.02, 0.15, and 0.35 (small, medium, large); (3) Q2 predictive relevance: Q2 > 0 indicates the model has predictive relevance. Meanwhile, Q2 < 0 indicates the model lacks predictive relevance; (4) Significance: The p-value indicates 0.10, 0.05, and 0.01 (Grissom & Kim, 2005;Henseler et al., 2009). ...
This study aims to examine the influence of live streaming in social media, personal branding and political communication of presidential candidates on Gen Z involvement in the 2024 elections of the Republic of Indonesia. The research uses a post positivism world view with a quantitative descriptive approach. This research involved 131 respondents from Gen Z who were selected purposively. The research data was collected using questionnaires distributed through WhatsApp, Facebook and Instagram. The research data were analyzed with structural equation model (SEM) statistics with the help of WarpPLS 8.0 software. The results of this study indicate that live streaming in social media, personal branding and political communication have a significant effect on Gen Z involvement in the 2024 presidential election of the Republic of Indonesia. Thus the results of this study can be concluded that Gen Z's involvement in elections in Indonesia is due to exposure to political communication in live streaming social media and personal branding of candidates in attracting the interest of the younger generation. Therefore, one of the approaches for a presidential candidate to get Gen Z sympathy is to use social media optimally. Although this research still has limitations, which only involves respondents from among students in a small scope, so that future researchers can develop in a broad research context and many respondents.
... Given that the effect of the intervention was examined in this study by comparing the standardized difference between the intervention and control groups, the standardized mean deviation Hedges' g was utilized as an effect size to test the intervention's effectiveness. Hedges' g is the standardized mean difference between the two group means and provides a more accurate estimate of the effect size than Cohen's d (Grissom and Kim, 2005). Due to the potential discrepancies between the many studies included in this metaanalysis, a random effects model was utilized in this study. ...
The rapid expansion of the Internet and social media has intensified the spread of health misinformation, posing significant risks, especially for older adults. This meta-analysis synthesizes evidence on the prevalence and interventions of health misinformation among older adults. Our findings reveal a high prevalence rate of 47% (95% CI [33%, 60%]), surpassing recent estimates. Offline research settings have a higher prevalence of health misinformation. Despite methodological variances, the prevalence remains consistent across different measures and development levels. Interventions show significant effectiveness (Hedges’ g = 0.76, 95% CI [0.25, 1.26]), with graphic-based approaches outperforming video-based ones. These results underscore the urgent need for tailored, large-scale interventions to mitigate the adverse impacts of health misinformation on older adults. Further research should focus on refining intervention strategies and extending studies to underrepresented regions and populations.
... The difference between the severity of the injected issues and the additional ones identified either manually or automatically is statistically significant (p-value < 0.01) with a medium effect size (Mann-Whitney test [41] and the Cliff's delta [42]). An example of an automatically identified issue classified as low severity by both developers is: "This method provides an interesting feature by [. . . ...
Several techniques have been proposed to automate code review. Early support consisted in recommending the most suited reviewer for a given change or in prioritizing the review tasks. With the advent of deep learning in software engineering, the level of automation has been pushed to new heights, with approaches able to provide feedback on source code in natural language as a human reviewer would do. Also, recent work documented open source projects adopting Large Language Models (LLMs) as co-reviewers. Although the research in this field is very active, little is known about the actual impact of including automatically generated code reviews in the code review process. While there are many aspects worth investigating, in this work we focus on three of them: (i) review quality, i.e., the reviewer's ability to identify issues in the code; (ii) review cost, i.e., the time spent reviewing the code; and (iii) reviewer's confidence, i.e., how confident is the reviewer about the provided feedback. We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review. During the experiment we monitored the reviewers' activities, for over 50 hours of recorded code reviews. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior: Reviewers tend to focus on the code locations indicated by the LLM rather than searching for additional issues in other parts of the code. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process. Finally, the automated support did not result in saved time and did not increase the reviewers' confidence.
... This study used Hedge's g as the effect size measure. Compared to Cohen's d, Hedge's g provides a more accurate estimat ion, especially for small sample sizes (Grissom & Kim, 2005). Most of the included studies calculated the effect size based on means, standard deviations, and sample sizes. ...
This three-level meta-analysis investigates irony comprehension differences between autism spectrum disorder (ASD) and typically developing (TD) individuals. A comprehensive analysis of 29 articles reveals that individuals with ASD have lower irony comprehension ability than TD individuals (Hedge's g = -0.55). Furthermore, cultural background and matching strategy were identified as significant moderating variables, while age, language type, and task properties were not. Limitations include the imbalance in the number of studies across subgroups and an inability to analyze important cognitive factors. The findings highlight the need for further empirical research into cross-cultural differences, matching strategies, task properties, and cognitive factors affecting irony comprehension in individuals with ASD.
... Next, we transformed the extracted effect sizes into Hedge's g. Hedge's g was chosen as it was shown to provide a more accurate estimate of the standardized mean difference than Cohen's d as the latter tends to underestimate the effect size in small samples (Grissom & Kim, 2005). The effects have been coded in a way that a higher Hedge's g value indicates that the effect of financial-scarcity-related stim-uli on cognitive performance was more severely affected by household income. ...
Prior research suggested that financial-scarcity-related cues disproportionately impede the cognitive performance of the poor, but later studies questioned the extent and even the existence of this effect. In the present paper, we conducted a systematic review and a Bayesian meta-analysis with the aim to resolve the inconsistencies in the literature and to predict when and to what extent the effect appears. Based on 14 effect sizes from 10 studies, our results provide moderate evidence against the existence of the effect. This finding is not robust against the choice of prior distributions. If the effect does exist, its overall size is relatively small (g = 0.09 [-0.03, 0.21], τ = 0.16 [0.09, 0.29]). We did not find evidence for or against the presence of publication bias. We also found that the study designs of the identified studies were homogeneous, and the potential moderators were often not measured or reported limiting the generalisability of the prior findings in the literature. In sum, our main conclusion is that the evidence available in the literature is extremely limited, and it is not possible to make any strong inference. Finally, we provide recommendations for future research on the topic to overcome the shortcomings of the prevalent practices.
... Hedges' g was used instead of Cohen's d since it provided a better estimate in a small sample. 30 Statistical significance was set at α= 0.05 (two-tailed). ...
Purpose
Early intervention after trauma is needed for reduction in clinical distress and prevention of chronic posttraumatic stress disorder (PTSD). This study describes findings from an open pilot trial of a brief stabilization psychotherapy based on imagery techniques for adults with acute PTSD (i.e., within 3 months of onset).
Materials and Methods
Four sessions of 60-minute individual psychotherapy were conducted on 18 participants with PTSD within 3 months after accidents, 15 of whom completed the treatment. The clinician-administered PTSD scale for Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), the Hamilton Depression and Anxiety Rating Scales, and self-questionnaires were administered at pre-treatment, post-treatment, and 6-month follow-up.
Results
Eight (53.3%) of the 15 patients at post-treatment and 8 of the 9 patients at 6-month follow-up did not meet the DSM-5 criteria for PTSD. Reliable change of PTSD symptoms after treatment was observed in 6 of 15 (45.0%) patients at post-treatment and in 4 of 9 (45.0%) patients after 6 months. There was a significant decrease in PTSD, depression, anxiety, and impaired quality of life scores after treatment, and these gains were maintained after 6 months. No cases of exacerbated PTSD symptoms were observed among completers and non-completers.
Conclusion
Our findings suggest that brief stabilization sessions are safe treatment options for acute PTSD (KCT0001918).
... This result shows that publication bias is not statistically significant. Standardized effect sizes expressed as Hedges's g or Cohen's d are used to calculate the effect size (Grissom & Kim, 2005). In the analysis, the standardized mean difference mean Cohen's d and Hedges' g coefficients in comparing effect size values was used. ...
Purpose: This study is a meta-analysis aimed at evaluating ergonomic issues in workplaces in terms of occupational health and safety practices. Materials and Methods: This study is of a meta-analytic nature. Data used in the meta-analysis were obtained through searches on databases such as Google Scholar, YÖK Thesis, EBSCO, and Web Of Science without any time restrictions between January and March 2024. A total of 15 research studies were included in the search results. The data obtained were synthesized using narrative synthesis and meta-analysis methods. The total sample size of the studies used in this study is 65,160. Results: In this meta-analysis study, it was found that the practices implemented for addressing ergonomic issues faced by employees in the workplace were effective (SMD: 0.367, 95% CI: 0.055-0.679; Z = 2.305, p = 0.021, I2 = 97.761%, Q = 625.169). The variance among studies is statistically significant according to the analysis results (p < 0.05). It is believed that challenges related to the structure of work in workplaces lead to various discomforts in employees, and therefore, ergonomic practices can minimize physical strains on employees, contribute to work efficiency, and enhance employee health and safety.
... Cliff 's Effect-size Measure: To measure the size of the difference in consistency, we use Cliff's delta effect size measure. According to the guidelines in [51], this difference (d) can be small, medium, or large. Specifically, if d < 0.33, then this difference is small. ...
Recently, there has been a noticeable growth in textual content generated through advanced language models, such as chatGPT, across various social networks. ChatGPT can produce content that closely emulates human writing, making it indistinguishable from human content and introducing concerns regarding its potential exploitation by social bots for malicious purposes. This study undertakes a comprehensive investigation leveraging stylometric features to assess and identify bot accounts and chatGPT writing style on the Twitter platform. In particular, we extract stylometric features from bot- and human-written tweets, perform statistical tests, and evaluate the performance of machine-learning models fed by stylistic indicators. Our findings indicate that chatGPT-driven accounts are statistically different from human accounts based on consistency in their writing style, while the experimented models achieve an accuracy of up to 96% and 91% in the detection of chatGPT-based bot accounts and chatGPT-generated tweets, respectively. Finally, we assess the detection performance when adversarial text is introduced in test samples, demonstrating the robustness of the stylometry-based approach under adversarial attacks.
... A retrospective effect size determination and power analysis (twogroup independent sample t-tests) was conducted for the fish groups used in the mixed effects models to investigate if similarity of groups was due to the low sample size (i.e., type II error) or truly reflected a lack of variation in scale composition ratios. Effect size was measured using the Hedges g estimate, which is appropriate when dealing with sample sizes less than 20 (Grissom & Kim, 2005 ...
Fish scale microchemistry can be used to make life‐history inferences, although ecological studies examining scale composition are relatively rare. Salmon scales have an external layer of calcium phosphate hydroxyl apatite (HAP). The structure, hardness, and calcium content of this layer have been shown to vary within and between species. This variation may lead to misinterpretation of trace element profiles. This study uses backscatter scanning electron microscopy with electron dispersive spectrometry to compare scales from salmon populations and to present a more detailed analysis of scale HAP than was previously available. Our findings extend the range of salmon populations for which HAP Ca is available and confirm previous findings that the HAP Ca is relatively invariable within this species.
... To quantitatively assess the study's primary outcome, which is the change in the scores of daily bimanual task performance, we used Hedges' g with a 95% confidence interval (CI) because it provided a more accurate estimation in the case of a small sample size. 41 According to Hedges, g values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes respectively. 42 Additionally, this study evaluated I 2 and Cochran's Q statistics to examine the degree of heterogeneity among the studies. ...
Aim
To evaluate the effectiveness of magic‐themed interventions in improving daily bimanual task performance in children with unilateral spastic cerebral palsy (CP) and to elucidate the variability in outcomes.
Method
This systematic literature review searched databases including Embase, MEDLINE, Scopus, Cochrane Central, and CINAHL. Outcome measures selected for the meta‐analysis included the Children's Hand‐use Experience Questionnaire, its three subscales, and the Besta subscale C. The overall efficacy of magic‐themed interventions was analysed using Hedges' g as the summary measure for these outcomes. Subgroup analysis compared the efficacy of different modes of training, and a meta‐regression investigated the impact of training duration.
Results
Analyses of four studies involving 78 children showed magic‐themed training significantly improved bimanual task performance (Hedges' g = 0.327, 95% confidence interval [CI] = 0.107–0.547, p = 0.004), especially in group settings (Hedges' g = 0.435, 95% CI = 0.176–0.693, p = 0.001), compared with non‐significant gains from video interventions (Hedges' g = 0.041, 95% CI = −0.380 to 0.462, p = 0.850). Additionally, training duration positively correlated with performance gains (coefficient = 0.0076 per hour, p = 0.001).
Interpretation
Magic‐themed training, especially through group sessions and extended durations, enhances bimanual skills in children with unilateral spastic CP.
... A SMD higher than 0.5 was categorized as moderate, whereas an SMD higher than 0.8 was categorized as large. 45,46 Furthermore, subgroup analyses were considered when a sufficient number of studies were retrieved and when data were available to investigate the potential effects of EST on ventricular repolarization, as measured by the QT interval, in patients with LQT1, LQT2, and LQT3. ...
Objective
The safest and most effective exercise stress tests (EST) modalities for long QT syndrome (LQTS) are currently unknown. The main objective was to explore the effects of EST on the corrected QT interval (QTc) in patients with LQTS, and to compare the effects of different EST modalities (cycle ergometer vs treadmill).
Data Sources
Systematic searches were performed in September 2022 in accordance with the PRISMA statement through PubMed, Medline, EBM Reviews, Embase, and Web of Science.
Main Results
A total of 1728 patients with LQTS, whether congenital or acquired, without any age restrictions (pediatric age ≤18 years and adult age >19 years), and 2437 control subjects were included in the 49 studies. The QT interval data were available for 15 studies. Our analyses showed that the QT interval prolonged in a similar manner using either a cycle ergometer or a treadmill (standardized mean difference [SMD] = 1.89 [95% CI, 1.07-2.71] vs SMD = 1.46 [95% CI, 0.78-2.14], respectively). Therefore, it seems that either modality may be used to evaluate patients with LQTS.
Conclusions
The methodology for the measurement of the QT interval was very heterogeneous between studies, which inevitably influenced the quality of the analyses. Hence, researchers should proceed with caution when exploring and interpreting data in the field of exercise and LQTS.
... When calculating the effect size, Cohen's d or Hedges' g values are commonly used (Grissom and Kim, 2005). In this meta-analysis, the effect size was calculated using Cohen's d, and the significance level of the analyses was set at 95%. ...
Bu çalışma, deprem felaketi yaşayan bireylere uygulanan psikoterapötik müdahalelerin etkisini ortaya koymak amacı ile yapılmıştır. Meta analiz niteliğindeki bu çalışma için, Haziran-Eylül 2022’de PubMed, Web of Science, Google Akademik ve YÖK Tez Merkezi veri tabanları taranmıştır. Taramalar yıl sınırlamasına gidilmeden yapılmıştır. Çalışmaya incelemeler yapıldıktan sonra 13 araştırma alınmıştır. Veriler meta-analiz ve öyküsel anlatım yöntemleri ile sentez edilmiştir. Bu meta analizde deprem felaketi yaşayan bireylere uygulanan psikoterapötik müdahalelerin etkin olduğu tespit edilmiştir (SMD: -1.200. %95 CI: -1.692- 0.707; Z= -4.776, p = 0.000. I2= %97.116). Ayrıca çalışmada araştırmanın yapıldığı ülke/kıtanın, kullanılan psikoterapötik müdahale çeşitlerinin bireylere uygulanan psikoterapötik müdahalenin etkinliği üzerinde rol oynadığı belirlenmiştir. Ayrıca bilişsel davranışçı terapi, psikoterapi ve akupunktur yönteminin deprem felaketi yaşayan bireylerde etkin olduğu saptanmıştır. Deprem felaketi yaşayan bireylere yapılan psikoterapötik müdahaleler bireylerde oldukça olumlu etkiler yaratmakta ve bireylerin ruh sağlığını olumlu yönde etkilemektedir. Deprem felaketi yaşayan bireylere bilişsel davranışçı terapi, psikoterapi ve akupunktur yöntemlerinin uygulanması önerilmektedir.
... when dealing with small sample sizes (Grissom & Kim, 2005). In addition, Hedge's g uses pooled weighted standard deviations-compared to pooled standard deviations (Durlak, 2009). ...
In freshwater ecosystems, consumers can play large roles in nutrient cycling by modifying nutrient availability for autotrophic and heterotrophic microbes. Nutrients released by consumers directly support green food webs based on primary production and brown food webs based on decomposition. While much research has focused on impacts of consumer driven nutrient dynamics on green food webs, less attention has been given to studying the effects of these dynamics on brown food webs.
Freshwater mussels (Bivalvia: Unionidae) can dominate benthic biomass in aquatic systems as they often occur in dense aggregations that create biogeochemical hotspots that can control ecosystem structure and function through nutrient release. However, despite functional similarities as filter‐feeders, mussels exhibit variation in nutrient excretion and tissue stoichiometry due in part to their phylogenetic origin. Here, we conducted a mesocosm experiment to evaluate how communities of three phylogenetically distinct species of mussels individually and collectively influence components of green and brown food webs.
We predicted that the presence of mussels would elicit a positive response in both brown and green food webs by providing nutrients and energy via excretion and biodeposition to autotrophic and heterotrophic microbes. We also predicted that bottom‐up provisioning of nutrients would vary among treatments as a result of stoichiometric differences of species combinations, and that increasing species richness would lead to greater ecosystem functioning through complementarity resulting from greater trait diversity.
Our results show that mussels affect the functioning of green and brown food webs through altering nutrient availability for both autotrophic and heterotrophic microbes. These effects are likely to be driven by phylogenetic constraints on tissue nutrient stoichiometry and consequential excretion stoichiometry, which can have functional effects on ecosystem processes. Our study highlights the importance of measuring multiple functional responses across a gradient of diversity in ecologically similar consumers to gain a more holistic view of aquatic food webs.
... Additionally, the absence of publication bias was confirmed through an examination of the funnel plot (see Fig. 2). The effect sizes, standardized as Cohen's d or Hedges's g, were utilized to measure the effect size [17]. In this investigation, Cohen's d was employed, with statistical significance set at 95%. ...
Objective This systematic review and meta-analysis aimed to assess the efectiveness of nurses’ psychosocial interventions for addressing sensory deprivation in intensive care units (ICUs).
Materials and methods A comprehensive search of PubMed, Web of Science, EBSCOhost, Google Scholar, CİNAHL, Embase, Cochrane Library, and YÖK Thesis Center databases was conducted from August 2023 to May 2024, without any
temporal restrictions. In addition, a physical search was made in the university library for grey literature.
Results The study revealed that nurses’ psychosocial interventions signifcantly improved patients’ level of consciousness
(SMD=1.042, %95 CI=0.716 to 1.369; Z=6.25; p<.05) and sleep quality in ICUs (SMD=1.21, 95% CI= 0.232 to 1.810; Z=2.49; p<.05). The efectiveness of psychosocial interventions varied based on the type of intervention, patient age, ICU type, patient group, and intervention duration. Notably, auditory stimuli and aromatherapy demonstrated particularly high efect sizes, signifcantly enhancing patients’ levels of consciousness and sleep quality.
Conclusion In conclusion, psychosocial interventions aimed at reducing sensory deprivation in intensive care units exert benefcial efects on individuals, notably enhancing their level of consciousness and improving sleep quality.
... While the statistical tests allow checking the presence of significant differences, they do not provide any information about the magnitude of such differences. Therefore, we used the nonparametric Cliff's delta (| |) effect size [7]. The effect size is considered small for 0.148 ≤ | | < 0.33, medium for 0.33 ≤ | | < 0.474 and large for | | ≥ 0.474. ...
Autonomous surface vehicles (ASVs) need to complete missions without posing risks to other maritime traffic. Safe traffic is controlled by the International Regulations for Preventing Collisions at Sea (COLREGS) formulated by the International Maritime Organization (IMO). Being designed with human operators in mind, the COLREGS are intentionally underspecified, which may result in ambiguous requirements for correct behaviour for ASVs. Hence the systematic testing of such ambiguous situations is particularly important. This paper presents a model-based test generation approach for automatically deriving initial scenes of complex sea encounters involving multiple vessels by a multi-step refinement approach. First, a diverse set of functional scenarios are derived automatically. Then, we provide a mapping from functional scenarios to logical scenarios that capture geometrical constraints between potentially unsafe ship encounters. Finally, initial scenes with precise vessel placements are generated from logical scenarios by search-based algorithms. Our extensive evaluation shows that our approach derives a diverse set of initial scenes with high risk levels within few minutes, even for six vessel encounters.
Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes.
Giriş: Bu çalışma, üstün yetenekli öğrencilere yönelik farklılaştırma yöntemleri ile oluşturulan matematik eğitim programlarının, öğrencilerin matematik akademik başarısı ve matematiğe karşı tutumları üzerindeki etkilerini araştırmayı hedeflemektedir. Yöntem: Veriler meta-analiz yöntemiyle analiz edilmiştir. Çalışmaya dâhil edilen araştırmalar, 2012-2023 yılları arasında yayınlanmış, Türkçe veya İngilizce dilinde ve kontrol gruplu ön test son test deneysel desene sahip çalışmalardan oluşmaktadır. Belirlenen kriterler doğrultusunda başarı değişkeni için 18 çalışma, tutum değişkeni için 8 çalışma meta-analize dâhil edilmiştir. Heterojenlik testi sonuçlarına göre, başarı (Q = 218.087, p < .001, I2 = 92.205) ve tutum (Q = 32.147, p < .001, I2 = 78.225) değişkenlerinde etki büyüklüğünün analizi için rastgele etkiler modeli kullanılmıştır (p <. 05, I2 > 75). Bulgular: Analiz sonuçları, üstün yetenekli öğrencilere yönelik farklılaştırma yöntemleri ile oluşturulan matematik eğitim programlarının, öğrencilerin matematik başarısı üzerindeki etkisinin deney grubu lehine güçlü ve anlamlı olduğunu (Hedges’s g = 1.235, z = 7.391, p < .001) ve matematiğe karşı tutum üzerindeki etkisinin deney grubu lehine güçlü ve anlamlı olduğunu (Hedges’s g = 0.932, z = 3.477, p = .001) ortaya koymaktadır. Alt grup analizleri kapsamında, çalışmanın yapıldığı ülke, sınıf düzeyi, yayın türü ve farklılaştırma türü moderatörlerine göre elde edilen sonuçlar detaylı olarak raporlanmıştır. Tartışma: Çalışmanın bulguları, farklılaştırma yöntemlerinin üstün yetenekli öğrencilerin matematik akademik başarısı ve matematiğe karşı tutumlarını olumlu etkilediğini göstermektedir. Bu bulgular doğrultusunda, üstün yetenekli öğrencilerin bu programlardan en yüksek düzeyde fayda sağlamalarını mümkün kılmak için farklılaştırılmış eğitim yaklaşımlarının daha yaygın şekilde uygulanması önerilmektedir.
Large-scale Android apps that provide complex functions are gradually becoming the mainstream in Android app markets. They tend to display many GUI widgets on a single GUI page, which, unfortunately, can cause more redundant test actions—actions with similar functions—to automatic testing approaches. The effectiveness of existing testing approaches is still limited, suggesting the necessity of reducing the test effort on redundant actions. In this paper, we first identify three types of GUI structures that can cause redundant actions and then propose a novel approach, called action equivalence evaluation, to find the actions with similar functions by exploiting both GUI structure and functionality. By integrating this approach with existing testing tools, the test efficacy can be improved.
We conducted experiments on 17 large-scale Android apps, including three industrial apps Google News , Messenger , and WeChat . The results show that more instructions can be covered, and more crashes can be detected, compared to the state-of-the-art Android testing tools. 29 real bugs were found in our experiment, and moreover, 760 bugs over 40 versions of WeChat had been detected in the real test environment during a three-month testing period.
Introduction
The aim is to scrutinize approximate entropy (ApEn) to distinguish optimal complexity of heart rate variability (HRV) in children diagnosed with attention deficit hyperactivity disorder (ADHD). This was accomplished by varying their embedding dimension m and tolerance r. Determination of optimal m and r is heuristic. ApEn was enforced in ADHD to assess its effects on the HRV chaotic response.
Methods
We studied 56 children divided equally into two groups: ADHD and control. Autonomic modulation of the heart rate was monitored for 20 min in the supine position without any physical, sensory or pharmacological stimuli. ApEn initially had r: 0.1 → 1.0 in 0.1 intervals and m: 1 → 10 in intervals of 1. The statistical significances were measured by three effect sizes: Cohen’s d, Hedges’ g and Glass’s Δ.
Results
Those most statistically important were for r = 0.9334, and m = 1, 2 and 3. Cohen’s d (1.1277; m = 2) and Hedges’ g (1.1119; m = 2) are the most reliable effect sizes. Glass’s Δ (1.3724; m = 1) is unfortunately less reliable. ROC curve analysis shows AUC > 0.77 for r = 0.9334 and m = 1, 2, and 3.
Conclusion
ApEn recognized the increased chaotic response in ADHD. This was confirmed by three effect sizes, AUC and p value during ROC analysis. Still, ApEn is an unreliable mathematical marker. ADHD discrimination was only achieved by extending the surveillance ranges for r; 0.8 → 1.0 and m; 1 → 3 at intervals of 0.0167. This necessitates an ‘a priori’ study making it inapt for online analysis. Even so, it could be useful in ‘post hoc’ analysis.
Accumulating evidence implicates immune dysregulation and chronic inflammation in neurodevelopmental disorders (NDDs), often manifesting as abnormal alterations in peripheral blood immune cell levels. The mononuclear phagocyte system, including monocytes and microglia, has been increasingly recognized for its involvement in the pathogenesis of NDDs. However, due to inconsistent findings in the literature, whether monocytes can serve as a reliable biomarker for NDDs remains controversial. To address this issue, we conducted a systematic review and meta-analysis of studies examining monocyte counts in NDD individuals. A comprehensive search was conducted across PubMed, Web of Science, and Scopus databases. Variables extracted for analysis encompassed the author's name, year of study, sample size, patient's age, type of disease, mean, standard deviation of monocytes and sex ratio. A total of 2503 articles were found by searching the three databases. After removed duplicates and screening titles, abstracts, and full texts, 17 articles met the inclusion criteria, and 20 independent studies were included in the meta-analysis. The results indicated significantly increased monocyte counts in 5 type NDDs compared to Typical Development (TD) groups (g = 0.36, 95%CI [0.23, 0.49]). Subgroup analyses revealed no significant differences in monocyte counts across different NDD types, gender, or age. These findings suggest that aberrant alterations in monocyte counts are common in NDD cases, indicating their potential as biomarkers for these conditions. Future research should further investigate the role of monocyte in understanding the mechanisms, early detection, and clinical diagnosis of NDDs.
The use of new machines in production lines due to technological developments makes business ecosystems more complex every day. In parallel with the changes experienced, the diversity and impact level of risks pose serious threats to employees, businesses, and the environment. Ensuring the sustainability of production can be achieved through effective and comprehensive occupational health and safety practices. Risk assessments, checklists and emergency plans are some of these practices. This study is a study conducted to reveal the impact levels of practices aimed at improving occupational health and safety. Meta-analysis method was used in the study. The data used in the analysis were obtained by searching Web of Science, Google Scholar, YÖK (The Council of Higher Education), PubMed, EBSCOhost databases without any time limitation until 31.01.2024. As a result of the comprehensive search, it was determined that 20 studies were suitable for the analysis. These studies were then included in the analysis and synthesized by meta-analysis. As a result of this meta-analysis, it was determined that occupational health and safety practices for employees were effective (SMD: 0.924, 95% CI:-0.494-1.354, Z=4.214, p=0.000, I2= 98.670%, Q=1428.054). The analysis results revealed that the variance between the studies was statistically significant (p<0.05). Additionally, occupational health and safety practices were found to enhance employees' sense of security and productivity, reduce workplace accidents and occupational diseases, and make a significant contribution to the development of a safety culture.
Aim This study was conducted to reveal the efectiveness of telemedicine applications in mental health services.
Material method For this meta-analysis study, data were obtained by searching PubMed, Web of Science, EBSCOhost,
Google Scholar, and YÖK Thesis Center databases for the last 5 years in September–December 2023. After the review, 24
studies were included.
Findings In this meta-analysis, it was found that telemedicine interventions in mental health services decreased the depression level of individuals (SMD −0.168, 95% CI−0.315 to−0.021; Z= −2.243; p<.05). In addition, it was determined
that the duration of the intervention played a role in the efectiveness of telemedicine interventions applied to individuals.
In addition, it was determined that the type of intervention, the country where the research was conducted, and the patient
group did not play a role in the efectiveness of telemedicine interventions applied to individuals.
Conclusion Telemedicine applications in mental health services can play an efective role in reducing the burden of chronic
mental illness and improving patient outcomes
PurposeTo investigate the effects of web-based psychotherapeutic interventions on depression among individuals with mood disorders.
MethodFor this meta-analysis study, data were obtained from October to December 2023 by searching PubMed, Web of Science, EBSCOhost, Google Scholar, and YÖK Thesis Center for articles published in the past 5 years. In the first stage of the search, 12,056 records were obtained. After removing duplicate studies, 4,910 records were considered for title and abstract review. After this evaluation, 139 studies were identified for full-text review. After the review, six studies reporting results on the effectiveness of web-based psychotherapeutic interventions on depression among individuals with mood disorders were ultimately included.
ResultsWeb-based interventions had significant positive effects and provided decreases in depression levels (standardized mean difference = −0.168, 95% confidence interval [–0.315, −0.021]; Z = −2.243; p < 0.05).
ConclusionWeb-based interventions for mood disorders may play an effective role in reducing the burden of chronic mental illness and improving patient outcomes
Recent advances in large language models (LLMs), make it potentially feasible to automatically refactor source code with LLMs. However, it remains unclear how well LLMs perform compared to human experts in conducting refactorings automatically and accurately. To fill this gap, in this paper, we conduct an empirical study to investigate the potential of LLMs in automated software refactoring, focusing on the identification of refactoring opportunities and the recommendation of refactoring solutions. We first construct a high-quality refactoring dataset comprising 180 real-world refactorings from 20 projects, and conduct the empirical study on the dataset. With the to-be-refactored Java documents as input, ChatGPT and Gemini identified only 28 and 7 respectively out of the 180 refactoring opportunities. However, explaining the expected refactoring subcategories and narrowing the search space in the prompts substantially increased the success rate of ChatGPT from 15.6% to 86.7%. Concerning the recommendation of refactoring solutions, ChatGPT recommended 176 refactoring solutions for the 180 refactorings, and 63.6% of the recommended solutions were comparable to (even better than) those constructed by human experts. However, 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors, which indicate the risk of LLM-based refactoring. To this end, we propose a detect-and-reapply tactic, called RefactoringMirror, to avoid such unsafe refactorings. By reapplying the identified refactorings to the original code using thoroughly tested refactoring engines, we can effectively mitigate the risks associated with LLM-based automated refactoring while still leveraging LLM's intelligence to obtain valuable refactoring recommendations.
Safety training is crucial to mitigate the risk of damage when a disaster occurs and can play a vital role in enhancing community response. Augmented Reality (AR) is an emerging technology for safety training that holds great pedagogical potential. This study aims to explore the effectiveness of AR training in terms of knowledge acquisition and retention, as well as self-efficacy enhancement. We developed a new video see-through AR training tool on a tablet to teach users about operating a fire extinguisher to put out a fire following the PASS procedure: Pull, Aim, Squeeze, and Sweep (PASS). The AR training tool was tested with 60 participants. Test results were systematically compared with findings from the literature investigating Virtual Reality (VR) and video-based safety training. The findings indicate that, directly after the training, AR outperformed traditional video training in terms of knowledge retention, long-term self-efficacy, and quality of instructions. However, the AR experience was not as effective as the VR experience in all these areas, but the AR group had a smaller decrease in knowledge over time. These findings suggest that the AR-based training approach offers benefits in long-term memory recall.
Numerous authors suggest that the data gathered by investigators are not normal in shape. Accordingly, methods for assessing pairwise multiple comparisons of means with traditional statistics will frequently result in biased rates of Type I error and depressed power to detect effects. One solution is to obtain a critical value to assess statistical significance through bootstrap methods. The SAS system can be used to conduct step-down bootstrapped tests. The authors investigated this approach when data were neither normal in form nor equal in variability in balanced and unbalanced designs. They found that the step-down bootstrap method resulted in substantially inflated rates of error when variances and group sizes were negatively paired. Based on their results, and those reported elsewhere, the authors recommend that researchers should use trimmed means and Winsorized variances with a heteroscedastic test statistic. When group sizes are equal, the bootstrap procedure effectively controlled Type I error rates.
Consider the problem of performing all pair-wise comparisons among J dependent groups based on measures of location associated with the marginal distributions. It is well known that the standard error of the sample mean can be large relative to other estimators when outliers are common. Two general strategies for addressing this problem are to trim a fixed proportion of observations or empirically check for outliers and remove (or down-weight) any that are found. However, simply applying conventional methods for means to the data that remain results in using the wrong standard error. Methods that address this problem have been proposed, but among the situations considered in published studies, no method has been found that gives good control over the probability of a Type I error when sample sizes are small (less than or equal to thirty); the actual probability of a Type I error can drop well below the nominal level. The paper suggests using a slight generalization of a percentile bootstrap method to address this problem.
Given a random sample from each of two independent groups, this article takes up the problem of estimating power, as well as a power curve, when comparing 20% trimmed means with a percentile bootstrap method. Many methods were considered, but only one was found to be satisfactory in terms of obtaining both a point estimate of power as well as a (one-sided) confidence interval. The method is illustrated with data from a reading study where theory suggests two groups should differ but nonsignificant results were obtained.
In this commentary, we offer a perspective on the problem of authors reporting and interpreting effect sizes in the absence of formal statistical tests of their chanceness. The perspective reinforces our previous distinction between single -study investigations and multiple-study syntheses.
Data analysis methods in psychology still emphasize statistical significance testing, despite numerous articles demonstrating its severe deficiencies. It is now possible to use meta-analysis to show that reliance on significance testing retards the development of cumulative knowledge. But reform of teaching and practice will also require that researchers learn that the benefits that they believe flow from use of significance testing are illusory. Teachers must revamp their courses to bring students to understand that (a) reliance on significance testing retards the growth of cumulative research knowledge; (b) benefits widely believed to flow from significance testing do not in fact exist; and (c) significance testing methods must be replaced with point estimates and confidence intervals in individual studies and with meta-analyses in the integration of multiple studies. This reform is essential to the future progress of cumulative knowledge in psychological research.
Although estimating substantive importance (in the form of reporting effect sizes) has recently received widespread endorsement, its use has not been subjected to the same degree of scrutiny as has statistical hypothesis testing. As such, many researchers do not seem to be aware that certain of the same criticisms launched against the latter can also be aimed at the former. Our purpose here is to highlight major concerns about effect sizes and their estimation. In so doing, we argue that effect size measures per se are not the hoped-for panaceas for interpreting empirical research findings. Further, we contend that if effect sizes were the only basis for interpreting statistical data, social-science research would not be in any better position than it would if statistical hypothesis testing were the only basis. We recommend that hypothesis testing and effect-size estimation be used in tandem to establish a reported outcome's believability and magnitude, respectively, with hypothesis testing (or some other inferential statistical procedure) retained as a "gatekeeper" for determining whether or not effect sizes should be interpreted. Other methods for addressing statistical and substantive significance are advocated, particularly confidence intervals and independent replications.
The advantages that confidence intervals have over null-hypothesis significance testing have been presented on many occasions to researchers in psychology. This article provides a practical introduction to methods of constructing confidence intervals for multiple and partial R 2 and related parameters in multiple regression models based on “noncentral”F and χ2 distributions. Until recently, these techniques have not been widely available due to their neglect in popular statistical textbooks and software. These difficulties are addressed here via freely available SPSS scripts and software and illustrations of their use. The article concludes with discussions of implications for the interpretation of findings in terms of noncentral confidence intervals, alternative measures of effect size, the relationship between noncentral confidence intervals and power analysis, and the design of studies.
Although confidence interval procedures for analysis of variance (ANOVA) have been available for some time, they are not well known and are often difficult to implement with statistical packages. This article discusses procedures for constructing individual and simultaneous confidence intervals on contrasts on parameters of a number of fixed-effects ANOVA models, including multivariate analysis of variance (MANOVA) models for the analysis of repeated measures data. Examples show how these procedures can be implemented with accessible software. Confidence interval inference on parameters of random-effects models is also discussed.
Monte Carlo techniques were used to determine the effect of using common critical values as an approximation for uncommon sample sizes. Results indicate there can be a significant loss in statistical power. Therefore, even though many instructors now rely on computer statistics packages, the recommendation is made to provide more specificity (i.e., values between 30 and 60) in tables of critical values published in textbooks.
Trivials are effect sizes associated with statistically non-significant results. Trivials are like Tribbles in the Star Trek television show. They are cute and loveable. They proliferate without limit. They probably growl at Bayesians. But they are troublesome. This brief report discusses the trouble with trivials.
The question of how much to trim or which weighting constant to use are practical considerations in applying robust methods such as trimmed means (L-estimators) and Huber statistics (M-estimators). An index of location relative efficiency (LRE), which is a ratio of the narrowness of resulting confidence intervals, was applied to various trimmed means and Huber M-estimators calculated on seven representative data sets from applied education and psychology research. On the basis of LREs, lightly trimmed means were found to be more efficient than heavily trimmed means, but Huber M-estimators systematically produced narrower confidence intervals. The weighting constant of ψ = 1.28 was found to be superior to various competitors suggested in the literature for n < 50.
The purpose of this study was to compare the statistical power of a variety of exact tests in the 2 × C ordered categorical contingency table using StatXact software. The Wilcoxon Rank Sum, Expected Normal Scores, Savage Scores (or its Log Rank equivalent), and Permutation tests were studied. Results indicated that the procedures were nearly the same in terms of comparative statistical power.
This paper evaluates the D'Agostino Su test and the Triples test for testing symmetry versus asymmetry. These procedures are evaluated as preliminary tests in the selection of the most appropriate procedure for testing the equality of means with two independent samples under a variety of symmetric and asymmetric sampling situations.
In the continuing debate over the use and utility of effect sizes, more discussion often helps to both clarify and syncretize methodological views. Here, further defense is given of Roberts & Henson (2002) in terms of measuring bias in Cohen's d, and a rejoinder to Sawilowsky (2003) is presented.
The structure of the first invited debate in JMASM is to present a target article (Sawilowsky, 2003), provide an opportunity for a response (Roberts & Henson, 2003), and to follow with independent comments from noted scholars in the field (Knapp, 2003; Levin & Robinson, 2003). In this rejoinder, I provide a correction and a clarification in an effort to bring some closure to the debate. The intension, however, is not to rehash previously made points, even where I disagree with the response of Roberts & Henson (2003).
In the critique that follows, I have attempted to summarize the principal disagreements between Sawilowsky and Roberts & Henson regarding the reporting and interpreting of statistically non-significant effect sizes, and to provide my own personal evaluations of their respective arguments.
Monte Carlo research has demonstrated that there are many applications of the rank transformation that result in an invalid procedure. Examples include the two dependent samples, the factorial analysis of variance, and the factorial analysis of covariance layouts. However, the rank transformation has been shown to be a valid and powerful test in the two independent samples layout. This study demonstrates that the rank transformation is also a robust and powerful alternative to the Hotellings T 2 test when the data are on a likert scale.
The aim of this Monte Carlo study is to examine alternatives to estimated variability in building bracketed intervals about the trimmed mean.
Researchers engaged in computer-intensive studies may need exact critical values, especially for sample sizes and alpha levels not normally found in published tables, as well as the ability to control 'best-fit' criteria. They may also benefit from the ability to directly generate these values rather than having to create lookup tables. Fortran 90 programs generate 'best-conservative' (bc) and 'best-fit' (bf) critical values with associated probabilities for the Kolmogorov-Smirnov test of general differences (bc), Rosenbaum's test of location (bc), Tukey's quick test (bc and bf)) and the Wilcoxon rank-sum test (bc).
Attempts to attain knowledge as certified true belief have failed to circumvent Hume's injunction against induction. Theories must be viewed as unprovable, improbable, and undisprovable. The empirical basis is fallible, and yet the method of conjectures and refutations is untouched by Hume's insights. The implications for statistical methodology is that the requisite severity of testing is achieved through the use of robust procedures, whose assumptions have not been shown to be substantially violated, to test predesignated range null hypotheses. Nonparametric range null hypothesis tests need to be developed to examine whether or not effect sizes or measures of association, as well as distributional assumptions underlying the tests themselves, meet satisficing criteria.
Consider the comparison of two binomial variates with parameters p1 and p2. The parameter of non‐centrality for the Fisher–Irwin exact test is the odds ratio, ψ = p1 q2/p2 q1, where qi = 1 – pi, i = 1, 2. Exact confidence limits for ψ may be calculated (Fisher, 1935, 1962; Cornfield, 1956) and several authors have suggested approximate methods. The various approximate methods are reviewed and, from both theoretical and numerical results, we find that Cornfield's approximate method and Method II of Gart (1962) yield the better approximations under various conditions. An F‐test alternative to the exact test of 2 × 2 tables (Gart, 1962) is found to be accurate in small numbers.
Normative comparisons are a useful but stringent procedure for evaluating the value of therapeutic interventions. This procedure, consisting of comparing the behavior of treated subjects to that of nondisturbed subjects, is described, and its application to various commonly-used therapy outcome measures is discussed. Examples of the use of normative comparisons from the therapy outcome literature are provided. Statistical problems are considered, and suggested solutions to various potential pitfalls in normative comparisons are described. The need to provide evidence of treatment effectiveness convincing to lay skepticism, wherever possible, is underscored.
Nonparametric procedures are often more powerful than classical tests for real world data which are rarely normally distributed. However, there are difficulties in using these tests. Computational formulas are scattered throughout the literature, and there is a lack of availability of tables and critical values. The computational formulas for twenty commonly employed nonparametric tests that have large-sample approximations for the critical value are brought together. Because there is no generally agreed upon lower limit for the sample size, Monte Carlo methods were used to determine the smallest sample size that can be used with the respective large-sample approximation. The statistics reviewed include single-population tests, comparisons of two populations, comparisons of several populations, and tests of association.
A pair of asymmetric coefficients is introduced which are appropriate for measuring the association in ordered contingency tables. Earlier statistical methods are reviewed, and the new coefficients are shown to be closely related to both Kendall's tau-b and to Goodman and Kruskal's gamma, as well as the commonly used "percentage-difference." They are also shown to have a operational interpretation (in the Goodman-Kruskal sense). Their utility in both square and non-square contingency tables is discussed, as well as the way they may be interpreted as ordinal analogues of the traditional regression coefficients.
Typically two independent groups are compared in terms of some measure of location, usually the mean, or a method based on ranks. A concern about both of these approaches, already raised in statistical references, is that they can miss important differences. For example, a new treatment method might be beneficial for some subjects but detrimental for others. Details are given in this paper. An approach to this problem is to compare two groups in terms of multiple quantiles. The paper describes and illustrates a method for accomplishing this.
This paper presents techniques for use in meta-analytic research to estimate effect sizes when studies involve complex ANOVA, ANOVA with repeated measures, and complex ANOVA with repeated measures. Real examples are also provided to show the application of the techniques.
The Type I comparisonwise and experimentwise error rates were estimated empirically using Monte Carlo procedures for five multiple comparison procedures used for pairwise comparisons between means. The rates were shown to differ between the two cases (i) only carry out the procedure if the treatment F test is significant (ii) carry it out whether or not the F test is significant.
This essay presents a variation on a theme from my article "The use of tests of statistical significance", which appeared in the Spring, 1999, issue of Mid-Western Educational Researcher.
Fisher (1962) has proposed a method for obtaining confidence limits for the cross‐product ratio in a 2 × 2 table which requires that the user solve a quartic equation. Based on a different mathematical derivation, Cornfield (1956) had proposed a similar method. In the present article, we shall present simpler methods for obtaining confidence limits for the cross‐product ratio in a 2 × 2 table, and we shall extend these methods to obtain simultaneous confidence intervals for the r(r ‐ 1) c(c ‐ l)/4 cross‐product ratios in an r × c table (or for a subset of them) and also for the relative differences between the corresponding cross‐product ratios in K different r × c tables. In addition, we shall present a modification of a method suggested by Gart (1962a) for the 2 × 2 table, and we shall extend the modified method to the r × c table. The methods presented herein are easier to apply than those given in the earlier literature. For the 2 × 2 table, the confidence limits presented herein are asymptotically equivalent to the limits given earlier. When max (r, c, K) > 2, the simultaneous confidence intervals presented herein for the cross‐product ratios and for the relative differences between cross‐product ratios are asymptotically shorter than the corresponding intervals given by Cornfield (1956), for the usual probability levels. The method proposed herein for studying the relative differences between cross‐product ratios in K r × c tables can also be used to supplement the earlier methods of analysis given by Plackett (1962) and Goodman (1963b) for testing the null hypothesis that these differences are all nil.
This paper concerns a new Normal approximation to the beta distribution and its relatives, in particular, the binomial, Pascal, negative binomial, F, t, Poisson, gamma, and chi square distributions. The approximate Normal deviates are expressible in terms of algebraic functions and logarithms, but for desk calculation it is preferable in most cases to use an equivalent expression in terms of a function specially tabulated here. Graphs of the error are provided. They show that the approximation is good even in the extreme tails except for beta distributions which are J or U shaped or nearly so, and they permit correction to obtain still more accuracy. For use beyond the range of the graphs, some standard recursive relations and some classical continued fractions are listed, with some properties of the latter which seem to be partly new. Various Normal approximations are compared, with further graphs. The new approximation appears far more accurate than the others. Everything an ordinary user of the approximation might want to know is included in this paper. The theory behind the approximation and most proofs are postponed to a second paper immediately following this one.
S ummary
The end points of a confidence interval for the noncentrality parameter of noncentral χ ² or F are identified as upper and lower percentage points of a distribution. It is shown how these can be calculated by applying the inverse Cornish–Fisher expansion.
The same procedure can be used to find the value of the noncentrality needed to give a fixed size χ ² or F test some specified power; however, in this situation the results may be only approximate.
Contemporary research on sex differences in intellectual abilities has focused on male-female differences in average performance, implicitly assuming homogeneity of variance. To examine the validity of that assumption, this article examined sex differences in variability on the national norms of several standardized test batteries. Males were consistently more variable than females in quantitative reasoning, spatial visualization, spelling, and general knowledge. Because these sex differences in variability were coupled with corresponding sex differences in means, it was demonstrated that sex differences in variability and sex differences in central tendency have to be considered together to form correct conclusions about the magnitude of cognitive gender differences.
The binomial effect size display (BESD) has been proposed by Rosenthal and Rubin (1979, 1982; Rosenthal, 1990; Rosenthal & Rosnow, 1991) as a format for presenting effect sizes associated with certain experimental and nonexperimental research. An evaluation of the BESD suggests that its application is limited to presenting the results of 2 × 2 tables where φ is employed as the index of effect size. Findings indicate that the BESD provides little added information beyond an examination of the raw percentages in the 2 × 2 table and dramatically distorts effect sizes when binomial success rates vary from .50.
Most textbooks explain how to compute confidence intervals for means, correlation coefficients, and other statistics using “central” test distributions (e.g., t, F) that are appropriate for such statistics. However, few textbooks explain how to use “noncentral” test distributions (e.g., noncentral t, noncentral F) to evaluate power or to compute confidence intervals for effect sizes. This article illustrates the computation of confidence intervals for effect sizes for some ANOVA applications; the use of intervals invoking noncentral distributions is made practical by newer software. Greater emphasis on both effect sizes and confidence intervals was recommended by the APA Task Force on Statistical Inference and is consistent with the editorial policies of the 17 journals that now explicitly require effect size reporting.
A simulation study was conducted to examine probabilities of Type I errors of the two-sample Student t test, the Wilcoxon–Mann–Whitney test, and the Welch separate-variances t test under violation of homogeneity of variance. Two-stage procedures in which the choice of a significance test in the second stage is determined by the outcome of a preliminary test of equality of variances in the first stage were also examined. Type I error rates of both the t test and the Wilcoxon test were severely biased by unequal population variances combined with unequal sample sizes. The two-stage procedures were not only ineffective, they actually distorted the significance level of the test of location. Furthermore, the distortion was greatest when the discrepancy between variances was slight rather than extreme. Unconditional substitution of the Welch separate-variances t test for the Student t test whenever sample sizes were unequal was the most effective way to counteract modification of the significance level. Conditional substitution of the the Welch test, depending on the outcome of a preliminary test, was far less effective.
The boxplot has proven to be a very useful tool for summarizing univariate data. Several options of bivariate boxplot-type constructions are discussed. These include both elliptic and asymmetric plots. An inner region contains 50% of the data, and a fence identifies potential outliers. Such a robust plot shows location, scale, correlation, and a resistant regression line. Alternative constructions are compared in terms of efficiency of the relevant parameters. Additional properties are given and recommendations made. Emphasis is given to the bivariate biweight M estimator. Several practical examples illustrate that standard least squares ellipsoids can give graphically misleading summaries.
The Wilcoxon—Mann—Whitney test enjoys great popularity among scientists comparing two groups of observations, especially when measurements made on a continuous scale are non-normally distributed. Triggered by different results for the procedure from two statistics programs, we compared the outcomes from 11 PC-based statistics packages. The findings were that the delivered p values ranged from significant to nonsignificant at the 5% level, depending on whether a large-sample approximation or an exact permutation form of the test was used and, in the former case, whether or not a correction for continuity was used and whether or not a correction for ties was made. Some packages also produced pseudo-exact p values, based on the null distribution under the assumption of no ties. A further crucial point is that the variant of the algorithm used for computation by the packages is rarely indicated in the output or documented in the Help facility and the manuals. We conclude that the only accurate form of the Wilcoxon—Mann—Whitney procedure is one in which the exact permutation null distribution is compiled for the actual data.
A simulation study was done to compare seven confidence interval methods, based on the normal approximation, for the difference of two binomial probabilities. Cases considered included minimum expected cell sizes ranging from 2 to 15 and smallest group sizes (NMIN) ranging from 6 to 100. Our recommendation is to use a continuity correction of 1/(2 NMIN) combined with the use of (N − 1) rather than N in the estimate of the standard error. For all of the cases considered with minimum expected cell size of at least 3, this method gave coverage probabilities close to or greater than the nominal 90% and 95%. The Yates method is also acceptable, but it is slightly more conservative. At the other extreme, the usual method (with no continuity correction) does not provide adequate coverage even at the larger sample sizes. For the 99% intervals, our recommended method and the Yates correction performed equally well and are reasonable for minimum expected cell sizes of at least 5. None of the methods performed consistently well for a minimum expected cell size of 2.
Choice between two treatments, A and B, is sometimes based on the probability that A will be more effective (score higher, say) than B. Ideally, to estimate this probability a sample of subjects would receive both A and B and the proportion of (A — B) differences which are positive would be used as the estimate. Often, however, both treatments cannot be given to each subject, and inference is based on a trial using two independent samples. Unfortunately, probability structures exist for which P(A — B > 0) for two independent samples is not equal to P(A — B > 0) for matched samples. The two-independent-sample Wilcoxon test statistic addresses the former probability and hence cannot be used to answer the question, “Is the probability that A will do better than B greater than 1/2?” unless further assumptions are made.
The equivalence of some tests of hypothesis and confidence limits is well known. When, however, the confidence limits are computed only after rejection of a null hypothesis, the usual unconditional confidence limits are no longer valid. This refers to a strict two-stage inference procedure: first test the hypothesis of interest and if the test results in a rejection decision, then proceed with estimating the relevant parameter. Under such a situation, confidence limits should be computed conditionally on the specified outcome of the test under which estimation proceeds. Conditional confidence sets will be longer than unconditional confidence sets and may even contain values of the parameter previously rejected by the test of hypothesis. Conditional confidence limits for the mean of a normal population with known variance are used to illustrate these results. In many applications, these results indicate that conditional estimation is probably not good practice.
Using exact unconditional distributions, we evaluated the size and power of four exact test procedures and three versions of the χ statistic for the two-sample binomial problem in small-to-moderate sample sizes. The exact unconditional test (Suissa and Shuster 1985) and Fisher's (1935) exact test are the only tests whose size can be guaranteed never to exceed the nominal size. Though the former is distinctly more powerful, it is also computationally difficult. We propose an alternative that approximates the exact unconditional test by computing the exact distribution of the χ statistic at a single point, the maximum likelihood estimate of the common success probability. This test is a modification of the test of Liddell (1978), which considered the exact distribution of the difference in the sample proportions. Our test is generally more powerful than either the exact unconditional or Liddell's test, and its true size rarely exceeds the nominal size. The uncorrected χ statistic is frequently anticonservative, but the magnitude of the excess in size is usually moderate. Though this point has been somewhat controversial for many years, we endorse the view that one should not use Fisher's exact test or Yates's continuity correction in the usual unconditional sampling setting.
Inference procedures are proposed for any specified function of two random variables f(X, Y), assuming that independent random samples from the X and Y populations are available. A generalization of the Mann-Whitney statistic is used to obtain point and interval estimates for the probability that f(X, Y) falls in a given interval. One proposed confidence interval is a modification and improvement to the approach of Halperin, Gilbert, and Lachin (1987) for estimating Pr(X < Y). A thorough review is made of the literature on nonparametric confidence bounds for Pr(X < Y), with emphasis on methods based on the central limit theorem. In addition to the improvement of Halperin et al.'s method, bounds analogous to the Clopper-Pearson confidence interval for a binomial parameter are proposed. Simulation results show the performance of the alternative procedures. Approximate nonparametric tolerance limits for f(X, Y) are also proposed. The article closes by constructing a two-sided tolerance interval for glucose levels in human serum standard reference material. In this application, f(X, Y) = XY. Previous methods for constructing tolerance intervals for the product of two random variables required that X and Y be normally distributed.
The level of ordinary two-sample procedures is not preserved if the two populations differ in dispersion or shape. The effect of such differences, especially differences in dispersion, on the t, median, Mann-Whitney, and normal scores procedures is investigated asymptotically, and tables are given comparing the four procedures.
It is often interesting to compare the size of treatment effects in analysis of variance designs. Many researchers, however, draw the conclusion that one independent variable has more impact than another without testing the null hypothesis that their impact is equal. Most often, investigators compute the proportion of variance each factor accounts for and infer population characteristics from these values. Because such analyses are based on descriptive rather than inferential statistics, they never justify the conclusion that one factor has more impact than the other. This paper presents a novel technique for testing the relative magnitude of effects. It is recommended that researchers interested in comparing effect sizes apply this technique rather than basing their conclusions solely on descriptive statistics.
In this paper, we examine the issue of testing for superiority or inferiority following the equivalence conclusion. We contrast this situation with that of testing for superiority after a noninferiority conclusion. We point out the fact that further testing is irrelevant in situations where the equivalence definition renders the inferred true treatment effect difference nonsignificant from the clinical perspective. Consequently, superiority in the presence of an accompanying equivalence definition has a meaning that is different from its traditional interpretation.